Assignment 3: CJ Project - RQ1 Analysis
14 November 2025
Student vs Expert Agreement on Explanation Quality
1. Introduction
1.1 Research Question
The report is related to the Research Question 1: Are there any differences in the opinions of students and experts regarding the quality of explanation? The experiment recreates some features of Evans et al. (2022), as it involves the comparison of the judgements of mathematical explanations between the raters (students and experts) based on comparative judgement data. ## 1.2 Background
Comparative judgement (CJ) is the process in which groups of judges sequentially make judgements between pairs of items, which allow construction of a quality scale that is reliable but does not require rubrics. Analyzing the common grounds of the perception of the quality of explanation between the students and experts is relevant to the educational assessment and feedback practices. ## 1.3 Data Sources
Student judgements: workshop-groupA-judgements.csv and workshop-groupB-judgements.csv
Expert judgements: expert-judgements-Partial/decisions-clean.csv
Analysis focuses on explanations judged by both groups (n = r length(common_explanations)` common explanations)
2. Methods
2.1 Data Preparation
The data on student judgement are extracted out of two CSV files, which represent two workshop groups, and then combined into one dataset. Another column (judge_type), is added to distinguish student raters and expert raters. Equally, the expert judgement information is also imported and labelled as such. To facilitate compatibility and uniformity, the winner and the loser variables which depict the results of pairwise judgement are transformed to character data types. This step prevents the gaps that could be caused by the mixed data format hence determines the shared reasons perceived by both the students and professionals by the overlap of the unique distinguishing information in both data sets.
# Load and combine student judgements
groupA_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupA-judgements.csv")
groupB_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupB-judgements.csv")
student_judgements <- bind_rows(groupA_judgements, groupB_judgements) %>%
mutate(judge_type = "student")
# Load expert judgements
expert_judgements <- read.csv("C:/Users/f/Desktop/data/expert-judgements-PARTIAL/decisions-clean.csv") %>%
mutate(judge_type = "expert")
# Convert to character and find common explanations
student_judgements <- student_judgements %>%
mutate(winner = as.character(winner), loser = as.character(loser))
expert_judgements <- expert_judgements %>%
mutate(winner = as.character(winner), loser = as.character(loser))
common_explanations <- intersect(
unique(c(student_judgements$winner, student_judgements$loser)),
unique(c(expert_judgements$winner, expert_judgements$loser))
)
2.2 Analytical Approach
In this section, the modelling of the judgement data is done using the Plackett-Luce framework. The data are then filtered to only keep pairwise comparisons that include the common explanations that have been identified above. This will provide a basis of analytical comparability as both the analysis of the students and the expert will use the same sets of items.
The refined data is then converted into ranking objects by the function, as.rankings(), that changes winner-loser format into a ranking organization that can be used in modeling probabilistically. The two distinct Plackett-Luce models that fit these ranking objects include those that model student data and those that model expert data. In both models, the log-worth parameters (each) are the estimated measure of the latent quality or perceived merit of any individual explanation, depending upon the number of times it is chosen over others in a series of pairwise choices.
The coefficients of the fitted models are chosen out as quality scores of each explanation. Comparing between the groups, these scores give quantitative evidence of the extent to which students and experts agree in their judgments of the quality of the explanations. The script combines the coefficients that have been extracted into a single dataframe where the student quality score and the expert quality score are combined together with the difference and the absolute difference of the two quality scores.
Lastly, the relationship between the student and expert rankings is statistically evaluated using Spearman rank correlation coefficient which is a non parametric measure that measures the monotonic relationship between the two sets of rankings. The significance test, (cor.test) associated with it identifies an observed correlation to be significant, thereby giving empirical justification to either agree or disagree between the groups.
Data filtering: Retained only explanations judged by both students and experts
Model fitting: Separate Plackett-Luce models for student and expert judgements
Quality estimation: Derived log-worth values representing perceived quality
Comparison: Spearman correlation between student and expert quality rankings
Disagreement analysis: Identified explanations with largest rating differences
# Filter to common explanations and prepare for PlackettLuce
student_common <- student_judgements %>%
filter(winner %in% common_explanations & loser %in% common_explanations) %>%
select(judge, winner, loser)
expert_common <- expert_judgements %>%
filter(winner %in% common_explanations & loser %in% common_explanations) %>%
select(judge, winner, loser)
# Fit PlackettLuce models
student_rankings <- as.rankings(student_common[, c("winner", "loser")],
input = "orderings", id = student_common$judge)
expert_rankings <- as.rankings(expert_common[, c("winner", "loser")],
input = "orderings", id = expert_common$judge)
model_students <- PlackettLuce(student_rankings)
model_experts <- PlackettLuce(expert_rankings)
# Extract quality scores
student_qualities <- coef(model_students, log = FALSE)
expert_qualities <- coef(model_experts, log = FALSE)
# Create comparison dataframe
common_items <- intersect(names(student_qualities), names(expert_qualities))
comparison_df <- data.frame(
explanation_id = common_items,
student_quality = student_qualities[common_items],
expert_quality = expert_qualities[common_items]
) %>% mutate(
quality_difference = student_quality - expert_quality,
abs_difference = abs(quality_difference)
)
# Calculate correlation
correlation <- cor(comparison_df$student_quality, comparison_df$expert_quality,
method = "spearman", use = "complete.obs")
cor_test <- cor.test(comparison_df$student_quality, comparison_df$expert_quality,
method = "spearman")
3. Results
3.1 Data Overview
This section describes the descriptive statistics of datasets under analysis. The most important statistics are the number of common explanations that are considered, the overall number of judgements that each group gives, and the number of unique judges per sample. Through kableExtra, these statistics are summarized into a well-formatted table, which makes it possible to communicate the extent of data and the structure of samples.
Table 1: Dataset Summary
total_student_judgements <- nrow(student_common)
total_expert_judgements <- nrow(expert_common)
unique_student_judges <- length(unique(student_common$judge))
unique_expert_judges <- length(unique(expert_common$judge))
data_summary <- data.frame(
Metric = c("Common explanations", "Student judgements", "Expert judgements",
"Student judges", "Expert judges"),
Count = c(length(common_explanations), total_student_judgements,
total_expert_judgements, unique_student_judges, unique_expert_judges)
)
kable(data_summary, format = "html", booktabs = TRUE, caption = "Dataset Summary") %>%
kable_styling(latex_options = "hold_position")
| Metric | Count |
|---|---|
| Common explanations | 17 |
| Student judgements | 83 |
| Expert judgements | 41 |
| Student judges | 16 |
| Expert judges | 5 |
3.2 Quality Rankings Comparison
In the subsection, a comparative analysis of the highest-rated explanations was made according to the choice of each group. The script determines the top ten reasons according to the score of quality of the production offered by the Plackett-Luce models of the students and experts. The findings are tabulated according to the order of ranking and the scores of each group. This will allow easy comparison of the things each group perceived as of the best quality, which will give qualitative information on the similar or different evaluative patterns of the groups.
Table 2: Top 10 Explanations by Student and Expert Ratings
top_student <- head(sort(student_qualities, decreasing = TRUE), 10)
top_expert <- head(sort(expert_qualities, decreasing = TRUE), 10)
top_comparison <- data.frame(
Rank = 1:10,
Student_Top = names(top_student),
Student_Score = round(top_student, 4),
Expert_Top = names(top_expert),
Expert_Score = round(top_expert, 4)
)
kable(top_comparison, format = "html", booktabs = TRUE,
caption = "Top 10 Explanations by Group") %>%
kable_styling(latex_options = "hold_position")
| Rank | Student_Top | Student_Score | Expert_Top | Expert_Score | |
|---|---|---|---|---|---|
| 402 | 1 | 402 | 0.2430 | 410 | 0.3030 |
| 415 | 2 | 415 | 0.2353 | 416 | 0.0869 |
| 401 | 3 | 401 | 0.2053 | 406 | 0.0811 |
| 405 | 4 | 405 | 0.0587 | 418 | 0.0772 |
| 399 | 5 | 399 | 0.0575 | 415 | 0.0745 |
| 426 | 6 | 426 | 0.0489 | 419 | 0.0741 |
| 410 | 7 | 410 | 0.0454 | 402 | 0.0674 |
| 414 | 8 | 414 | 0.0436 | 401 | 0.0536 |
| 423 | 9 | 423 | 0.0148 | 405 | 0.0535 |
| 406 | 10 | 406 | 0.0148 | 426 | 0.0279 |
3.3 Agreement Analysis
The Agreement Analysis section plots the student and expert quality assessment correspondence with each other in a scatterplot created in ggplot2. The plot of each point represents an explanation, and the location is predetermined by the student-assigned quality on the x-axis, as well as expert-assigned quality on the y-axis. The general trend of agreement is represented as a regression line (along with a confidence interval). Moreover, there is a dashed diagonal line (y = x) to bear out the fact of absolute congruence of the two sets of scores.
The correlation between student and expert quality ratings was
round(correlation, 3)
## [1] 0.123
(Spearman’s ρ">ρ,
ifelse(cor_test$p.value < 0.001, "p < 0.001", paste("p =", round(cor_test$p.value, 3)))
## [1] "p = 0.639"
ggplot(comparison_df, aes(x = student_quality, y = expert_quality)) +
geom_point(size = 3, alpha = 0.7, color = "#2E86AB") +
geom_smooth(method = "lm", se = TRUE, color = "#A23B72", fill = "#F18F01") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#333333") +
labs(x = "Student-assigned Quality Score",
y = "Expert-assigned Quality Score") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
3.4 Key Disagreements
The last analysis step reveals the explanations in which the difference between student and expert judgments is the greatest. The script then isolates the ten cases of the highest divergence by ranking explanations on the basis of the absolute difference between the respective quality scores of the cases. The findings are displayed in a tabular format with each explanation of the student scores and expert scores, the raw difference and the absolute difference.
Specifically, this step is useful especially in the diagnostic interpretation because it assists in identifying certain explanations that cause a disproportionate amount of disagreement.
Table 3: Top 10 Explanations with Largest Student-Expert Disagreement
top_disagreements <- comparison_df %>%
arrange(desc(abs_difference)) %>%
head(10) %>%
mutate(across(c(student_quality, expert_quality, quality_difference, abs_difference),
~round(., 4)))
kable(top_disagreements, format = "html", booktabs = TRUE,
caption = "Explanations with Largest Rating Differences") %>%
kable_styling(latex_options = "hold_position")
| explanation_id | student_quality | expert_quality | quality_difference | abs_difference | |
|---|---|---|---|---|---|
| 410 | 410 | 0.0454 | 0.3030 | -0.2577 | 0.2577 |
| 402 | 402 | 0.2430 | 0.0674 | 0.1756 | 0.1756 |
| 415 | 415 | 0.2353 | 0.0745 | 0.1607 | 0.1607 |
| 401 | 401 | 0.2053 | 0.0536 | 0.1517 | 0.1517 |
| 416 | 416 | 0.0023 | 0.0869 | -0.0846 | 0.0846 |
| 418 | 418 | 0.0030 | 0.0772 | -0.0743 | 0.0743 |
| 419 | 419 | 0.0057 | 0.0741 | -0.0684 | 0.0684 |
| 406 | 406 | 0.0148 | 0.0811 | -0.0663 | 0.0663 |
| 399 | 399 | 0.0575 | 0.0032 | 0.0543 | 0.0543 |
| 414 | 414 | 0.0436 | 0.0203 | 0.0233 | 0.0233 |
4. Discussion
4.1 Level of Agreement
The observed correlation of 0.123 (ρ">ρ) suggests
ifelse(correlation > 0.7, "strong",
ifelse(correlation > 0.4, "moderate", "weak"))
## [1] "weak"
agreement between student and expert judgments of explanation quality. This finding
ifelse(cor_test$p.value < 0.05, "is statistically significant", "is not statistically significant")
## [1] "is not statistically significant"
indicating that
ifelse(cor_test$p.value < 0.05, "students and experts share similar conceptions of explanation quality", "there may be fundamental differences in how students and experts evaluate explanations")
## [1] "there may be fundamental differences in how students and experts evaluate explanations"
4.2 Patterns in Disagreements
The largest disagreements occurred for explanations where:
Students rated explanations much higher than experts (positive differences)
Experts rated explanations much higher than students (negative differences)
These discrepancies may reflect different valuation of explanation features:
Students may prioritize clarity, simplicity, or familiarity
Experts may value mathematical rigor, completeness, or technical accuracy
4.3 Comparison with Evans et al. (2022)
Our findings
ifelse(correlation > 0.5, "align with", "contrast with")
## [1] "contrast with"
Evans et al.’s observation that students and experts generally agree on explanation quality. The
ifelse(correlation > 0.5, "moderate to
strong", "weak")
## [1] "weak"
correlation suggests
ifelse(correlation > 0.5,
"shared understanding", "different perspectives")
## [1] "different perspectives"
of what constitutes high-quality mathematical explanations.
5. Limitations
Partial overlap: Only r length(common_explanations) explanations were judged by both groups, limiting comparability
Context differences: Student and expert judgements were collected in different contexts
Sample size: The number of common items may affect correlation reliability
Model assumptions: Plackett-Luce assumes transitive preferences, which may not always hold
6. Conclusion
This analysis reveals
ifelse(correlation > 0.5, "substantial", "limited")
## [1] "limited"
agreement between student and expert judgments of mathematical explanation quality. The correlation of
round(correlation, 3)
## [1] 0.123
suggests that
ifelse(correlation > 0.5, "students and experts largely share similar
quality standards", "students and experts employ different criteria when
evaluating explanations")
## [1] "students and experts employ different criteria when\nevaluating explanations"
Educational implications:
ifelse(correlation > 0.5, "Student peer assessment may be reasonably reliable for explanation quality", "Caution is needed when using student peer assessment for explanation quality")
## [1] "Caution is needed when using student peer assessment for explanation quality"
ifelse(correlation > 0.5, "Shared understanding supports collaborative learning environments", "Different perspectives highlight need for explicit quality criteria")
## [1] "Different perspectives highlight need for explicit quality criteria"
Future research should investigate the specific features that students and experts value in mathematical explanations to better understand the sources of both agreement and disagreement.



Comments