EduG Result Interpretation : Generalizability Theory in OSCE

In my previous article,, I did not wrote on how to interpret the result from EduG calculation. Therefore, in this section, I will try to explain the necessary step to interpret the result.

But before we start, I will let you know “fun-statistic” warning from the book/article I read. In the introduction part about Generalizability Theory (GT), mostly mention that in order to study GT, having strong background knowledge of Analysis of Variance (Anova) is necessary. What the …. ??? I want to learn GT and basically they ask us, Hey, do not continue to read this book unless you understand Anova inside and out.

And the other interesting warning is in the interpretation section, that in order to interpret the GT table there is no certain rule on how to judge the result. Another whaat moment ???

Later on my learning process, I realize that deeply understand the Anova, the concept behind GT, and analyze many data sets, is the only way to be able to read and interpret the result of GT. I will try to make it simple in this article, which is specific for the Objective Structure Clinical Examination (OSCE) and to be more specific for the scenario in my article I mentioned above.

“An OSCE with 10 stations, which is conducted in 3 different locations. This OSCE is administered in One day only, start in the morning and finish in the evening. There is small break every 2 cycles, and there is a full brake in the mid afternoon for lunch. In our school, clinical rotation program, there are 4 big department assess OSCE in the same day. In this scenario I took data from 2 departments only.”

*note: the terms in OSCE use in this article is referring to “The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part II: Organisation & Administration”

Little bit back to Anova, analyzing 2 facets of GT will be the same with classical test theory (CTT), therefor most likely in learning Anova we should consider to learn factorial Anova. Other than factorial in Anova, multi-level Anova is important to learn too. The equivalence terms in GT will be facet (for variance) and nested (for multi-level). Plus, the concept of mixed design Anova if the analysis of the GT

Understanding the decision in calculation also the exam condition is important, therefore we will review one by one.

Determining Possible Source of Error

First thing first in GT is imagining your source of error. After determining the source, we have to look back at our data, whether it can match with our data properties or how complete the data being save in our OSCE. So, base on this 2 factors, even we can guess the possible source of error but we do not have the information, then the error is not possible to calculate. For example, we want to calculate the effect of possible source of error base on location, if we never save the location information in each student data, then we can not perform that in GT.

In this scenario, the students are unique to the exam location. Why ? Because each OSCE circuit will be different in :

  1. Location, of course, which mean building and room condition.
  2. Standardized patients (SPs), and
  3. The examiners

Since we only have 1 examiner for each station, most likely become one of the source of error in measuring. You have to remember that increasing the number of the examiner is one of many factors to improve the exam reliability. We will not consider the station design including the rubric will be the source of error, since in all circuits have the same settings.

Base of the facts above, location is one possible source of error, therefore the students are nested in locations. In Anova, this condition is design as a level in location variance. At this point we already have 2 source of error, students and locations, or on the other hand, 2 facets already defined.

Is that all facets or source of error? No. As you can see above, the scenario mention that this OSCE manage by 2 departments. The OSCE settings tells us that each department have 5 stations, making this OSCE circuit consist of 10 stations in total. In this condition, the possible source of error are stations and the departments it self. Since stations are own by departments, therefor the stations is nested within department. Or, in Anova, we can see it as a level in stations variance. The stations is unique to each departments.

Would it be possible that the rubrics is the source of error in this scenario ? Yes, it is possible. As long as you remember the principle in analysis : you have the data to analyze it. If you keep the file information of the students performance in each station, per item not only total score per student per station, then it is possible. In this scenario, we do not have that information.

So, in this section, we can summarize that this OSCE has 4 facets. With 2 facets nested to the other facets.

Determining Level of The Facets

This is why we need to have background knowledge of Anova especially mixed design. Which factor is fixed effect (between-subject variable) or random effect (within-subjects variable). In some analysis design, the decision in choosing random or fixed effect will have a huge result differences. For example, when you are considering Standard Error Measurement (SEM) to calculate Smallest Detectable Different (or Smallest Detectable Change).

To make this setting easy to learn and the best way to describe this exam condition, we choose fixed effect on all variable, or in EduG you must define the level of facet by type it “INF”. In an easy explanation, we consider the exam as a part of population in the world not only considering that this OSCE is the only exam in the world. However as you can see in the D-study, one of the setting is random effect, in this case we want to see the differences between fixed and random effect. Depends on the level of the facets, sometimes we can make a guess whether the error come from facets or not. Is that error should be ignored or not ? Depends on the condition too.

Note : as statistic is a daunting subject in med school including me, the easiest way to understand statistic by learning from the Guru and ask them to show the graph of every phenomenon. We can easily memorize the visual rather than trying hard to understand the formula. Youtube can be very helpful source, you could try. For example, when I learnt one-way anova to factorial anova, many lesson in youtube provide the interactive explanation. Easy enough for people with “numbers phobia”, even though takes time.

Now, we can start to analyze.

Remember, it will not as straightforward as reading Anova, the statistically significant difference variance is the one that have significant value below the alpha. Reading the G-table will not be as simple as that.

First, the G coefficient (G-coef) in this scenario is 0.71. At this point, we should understand that the result is quite good but not good enough. At this stage, we have around 30% of error.

The question is, among those facets, which one is primarily source of error? Imagine a cake , this cake is our error. Which part is the biggest chunk ?

One of the best feature in GT is the ability to pin point in exact location of the error and quantified the error. Hence, we can fix the error and will have better future exam. CTT can not perform this. If we do not know the exact location, it is like trying to walk blind folded. We can not fix the problem, or probably we fix unnecessary thing.

Back to the result, the G table show quite high percentage in LD and LO:D. LD is the location and department. Or we can say that there is problem in the circuit in certain location. And with LO:D , location – observation (station) nested in department, further telling us that the problem is in the stations in one of the department.

The biggest percentage is SO:LD, adding more information that the student interaction with LO:D (a problematic one) it create bigger problem. The student is not the problem, since the problem arise only when it has interaction with LO:D (station in one of the department, in certain location).

Which location and which department ?
We need to independently calculation the G-coef, by using level reduction. The results is as follows from Building A, B, and C respectively :

  • (G-coef) 0.80 , 0.80 , and 0.63
  • (SEM) 4.78 , 2.98 , and 4.96

It is clear that the source of error coming from Building C, since the G-coef is the lowest compare to Building A and B. And then let see the information from Means calculation.

The Grand Mean is 58.3. We should expect that the mean in all location will be around that value. Why ? Because the student placement is random base on their student ID number. The number is base on the time of registration at the first time. Therefor, we should expect that the distribution of the student will be equal in terms of student capability. There is no grouping base on the Entrance Exam’s score.

In the graph above, the location 3 (Building C) is the one that give us the alarm. Further down, we can see that in department A in Building C give us higher mean compare to other place. Base on these two evidence, G table and Means, give us direction to the exact location of the error. Department A in location Building C.

The next question will be, what is the possible error ? The building it self ? The rubrics ? The examiners ? or the SPs ? We will need further evidence.

Now, we will try judging the score by visually review the graph. After writer analyzing so many data, ploting the data into boxplot is found to be very helpful. As we can see from Department B, in all location, the distribution of the score is good. Pattern in each station per location is more or less similar. Whilst in Department A in Building C looks little bit different compare to the other 2 location. Because, if we compare building A and B, in both departments, the students performance look similar. With this fact, I would guess that the rubric is good with the result of stable output. If then we compare in both departments in Building C, the score range should not be that much different in visual pattern. My best guess, station 1 and 5 in department A are the problematic stations.

Note : this visual judgement is not valid and tested yet, but according to the writer experience very helpful.

All in all, base on the fact above, the findings suggest that the source of error is in Department A in Building C. Since the rubrics giving good result on the other 2 buildings, rubrics is not the problems. With the trained professional actor as SPs, most likely, SPs are not the source of error. The most probable cause is the examiners in Department A in Building C. The way the examiner judge the students, somehow have different standard compare to the other locations. In this situation, re-train the examiner is the best solution in order to have standardize in marking students performance.


I hope this article will help you, readers, to understand more about Generalizability Theory. And especially able to perform analysis in OSCE. The more we analyze data, the easier for us to interpret the GT result. I will not say easy, but easier.

Enjoy playing around with EduG.
Please feel free to leave feed back in the comment section.