Student vs Expert Agreement on Explanation Quality

1. Introduction

1.1 Research Question

The report is related to the Research Question 1: Are there any differences in the opinions of students and experts regarding the quality of explanation? The experiment recreates some features of Evans et al. (2022), as it involves the comparison of the judgements of mathematical explanations between the raters (students and experts) based on comparative judgement data. ## 1.2 Background

Comparative judgement (CJ) is the process in which groups of judges sequentially make judgements between pairs of items, which allow construction of a quality scale that is reliable but does not require rubrics. Analyzing the common grounds of the perception of the quality of explanation between the students and experts is relevant to the educational assessment and feedback practices. ## 1.3 Data Sources

Student judgements: workshop-groupA-judgements.csv and workshop-groupB-judgements.csv

Expert judgements: expert-judgements-Partial/decisions-clean.csv

Analysis focuses on explanations judged by both groups (n = r length(common_explanations)` common explanations)

2. Methods

2.1 Data Preparation

The data on student judgement are extracted out of two CSV files, which represent two workshop groups, and then combined into one dataset. Another column (judge_type), is added to distinguish student raters and expert raters. Equally, the expert judgement information is also imported and labelled as such. To facilitate compatibility and uniformity, the winner and the loser variables which depict the results of pairwise judgement are transformed to character data types. This step prevents the gaps that could be caused by the mixed data format hence determines the shared reasons perceived by both the students and professionals by the overlap of the unique distinguishing information in both data sets.

# Load and combine student judgements
groupA_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupA-judgements.csv")
groupB_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupB-judgements.csv")
student_judgements <- bind_rows(groupA_judgements, groupB_judgements) %>%
  mutate(judge_type = "student")

# Load expert judgements
expert_judgements <- read.csv("C:/Users/f/Desktop/data/expert-judgements-PARTIAL/decisions-clean.csv") %>%
  mutate(judge_type = "expert")

# Convert to character and find common explanations
student_judgements <- student_judgements %>%
  mutate(winner = as.character(winner), loser = as.character(loser))
expert_judgements <- expert_judgements %>%
  mutate(winner = as.character(winner), loser = as.character(loser))

common_explanations <- intersect(
  unique(c(student_judgements$winner, student_judgements$loser)),
  unique(c(expert_judgements$winner, expert_judgements$loser))
)

2.2 Analytical Approach

In this section, the modelling of the judgement data is done using the Plackett-Luce framework. The data are then filtered to only keep pairwise comparisons that include the common explanations that have been identified above. This will provide a basis of analytical comparability as both the analysis of the students and the expert will use the same sets of items.

The refined data is then converted into ranking objects by the function, as.rankings(), that changes winner-loser format into a ranking organization that can be used in modeling probabilistically. The two distinct Plackett-Luce models that fit these ranking objects include those that model student data and those that model expert data. In both models, the log-worth parameters (each) are the estimated measure of the latent quality or perceived merit of any individual explanation, depending upon the number of times it is chosen over others in a series of pairwise choices.

The coefficients of the fitted models are chosen out as quality scores of each explanation. Comparing between the groups, these scores give quantitative evidence of the extent to which students and experts agree in their judgments of the quality of the explanations. The script combines the coefficients that have been extracted into a single dataframe where the student quality score and the expert quality score are combined together with the difference and the absolute difference of the two quality scores.

Lastly, the relationship between the student and expert rankings is statistically evaluated using Spearman rank correlation coefficient which is a non parametric measure that measures the monotonic relationship between the two sets of rankings. The significance test, (cor.test) associated with it identifies an observed correlation to be significant, thereby giving empirical justification to either agree or disagree between the groups.

Data filtering: Retained only explanations judged by both students and experts

Model fitting: Separate Plackett-Luce models for student and expert judgements

Quality estimation: Derived log-worth values representing perceived quality

Comparison: Spearman correlation between student and expert quality rankings

Disagreement analysis: Identified explanations with largest rating differences

# Filter to common explanations and prepare for PlackettLuce
student_common <- student_judgements %>%
  filter(winner %in% common_explanations & loser %in% common_explanations) %>%
  select(judge, winner, loser)

expert_common <- expert_judgements %>%
  filter(winner %in% common_explanations & loser %in% common_explanations) %>%
  select(judge, winner, loser)

# Fit PlackettLuce models
student_rankings <- as.rankings(student_common[, c("winner", "loser")], 
                               input = "orderings", id = student_common$judge)
expert_rankings <- as.rankings(expert_common[, c("winner", "loser")], 
                              input = "orderings", id = expert_common$judge)

model_students <- PlackettLuce(student_rankings)
model_experts <- PlackettLuce(expert_rankings)

# Extract quality scores
student_qualities <- coef(model_students, log = FALSE)
expert_qualities <- coef(model_experts, log = FALSE)

# Create comparison dataframe
common_items <- intersect(names(student_qualities), names(expert_qualities))
comparison_df <- data.frame(
  explanation_id = common_items,
  student_quality = student_qualities[common_items],
  expert_quality = expert_qualities[common_items]
) %>% mutate(
  quality_difference = student_quality - expert_quality,
  abs_difference = abs(quality_difference)
)

# Calculate correlation
correlation <- cor(comparison_df$student_quality, comparison_df$expert_quality, 
                   method = "spearman", use = "complete.obs")
cor_test <- cor.test(comparison_df$student_quality, comparison_df$expert_quality, 
                     method = "spearman")

3. Results

3.1 Data Overview

This section describes the descriptive statistics of datasets under analysis. The most important statistics are the number of common explanations that are considered, the overall number of judgements that each group gives, and the number of unique judges per sample. Through kableExtra, these statistics are summarized into a well-formatted table, which makes it possible to communicate the extent of data and the structure of samples.

Table 1: Dataset Summary

total_student_judgements <- nrow(student_common)
total_expert_judgements <- nrow(expert_common)
unique_student_judges <- length(unique(student_common$judge))
unique_expert_judges <- length(unique(expert_common$judge))

data_summary <- data.frame(
  Metric = c("Common explanations", "Student judgements", "Expert judgements", 
             "Student judges", "Expert judges"),
  Count = c(length(common_explanations), total_student_judgements, 
            total_expert_judgements, unique_student_judges, unique_expert_judges)
)

kable(data_summary, format = "html", booktabs = TRUE, caption = "Dataset Summary") %>%
  kable_styling(latex_options = "hold_position")

Dataset Summary
Metric	Count
Common explanations	17
Student judgements	83
Expert judgements	41
Student judges	16
Expert judges	5

3.2 Quality Rankings Comparison

In the subsection, a comparative analysis of the highest-rated explanations was made according to the choice of each group. The script determines the top ten reasons according to the score of quality of the production offered by the Plackett-Luce models of the students and experts. The findings are tabulated according to the order of ranking and the scores of each group. This will allow easy comparison of the things each group perceived as of the best quality, which will give qualitative information on the similar or different evaluative patterns of the groups.

Table 2: Top 10 Explanations by Student and Expert Ratings

top_student <- head(sort(student_qualities, decreasing = TRUE), 10)
top_expert <- head(sort(expert_qualities, decreasing = TRUE), 10)

top_comparison <- data.frame(
  Rank = 1:10,
  Student_Top = names(top_student),
  Student_Score = round(top_student, 4),
  Expert_Top = names(top_expert),
  Expert_Score = round(top_expert, 4)
)

kable(top_comparison, format = "html", booktabs = TRUE, 
      caption = "Top 10 Explanations by Group") %>%
  kable_styling(latex_options = "hold_position")

Top 10 Explanations by Group
	Rank	Student_Top	Student_Score	Expert_Top	Expert_Score
402	1	402	0.2430	410	0.3030
415	2	415	0.2353	416	0.0869
401	3	401	0.2053	406	0.0811
405	4	405	0.0587	418	0.0772
399	5	399	0.0575	415	0.0745
426	6	426	0.0489	419	0.0741
410	7	410	0.0454	402	0.0674
414	8	414	0.0436	401	0.0536
423	9	423	0.0148	405	0.0535
406	10	406	0.0148	426	0.0279

3.3 Agreement Analysis

The Agreement Analysis section plots the student and expert quality assessment correspondence with each other in a scatterplot created in ggplot2. The plot of each point represents an explanation, and the location is predetermined by the student-assigned quality on the x-axis, as well as expert-assigned quality on the y-axis. The general trend of agreement is represented as a regression line (along with a confidence interval). Moreover, there is a dashed diagonal line (y = x) to bear out the fact of absolute congruence of the two sets of scores.

The correlation between student and expert quality ratings was

round(correlation, 3)

## [1] 0.123

(Spearman’s $ρ"> ρ$ ,

ifelse(cor_test$p.value < 0.001, "p < 0.001", paste("p =", round(cor_test$p.value, 3)))

## [1] "p = 0.639"

ggplot(comparison_df, aes(x = student_quality, y = expert_quality)) +
  geom_point(size = 3, alpha = 0.7, color = "#2E86AB") +
  geom_smooth(method = "lm", se = TRUE, color = "#A23B72", fill = "#F18F01") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#333333") +
  labs(x = "Student-assigned Quality Score", 
       y = "Expert-assigned Quality Score") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

3.4 Key Disagreements

The last analysis step reveals the explanations in which the difference between student and expert judgments is the greatest. The script then isolates the ten cases of the highest divergence by ranking explanations on the basis of the absolute difference between the respective quality scores of the cases. The findings are displayed in a tabular format with each explanation of the student scores and expert scores, the raw difference and the absolute difference.

Specifically, this step is useful especially in the diagnostic interpretation because it assists in identifying certain explanations that cause a disproportionate amount of disagreement.

Table 3: Top 10 Explanations with Largest Student-Expert Disagreement

top_disagreements <- comparison_df %>%
  arrange(desc(abs_difference)) %>%
  head(10) %>%
  mutate(across(c(student_quality, expert_quality, quality_difference, abs_difference), 
                ~round(., 4)))

kable(top_disagreements, format = "html", booktabs = TRUE,
      caption = "Explanations with Largest Rating Differences") %>%
  kable_styling(latex_options = "hold_position")

Explanations with Largest Rating Differences
	explanation_id	student_quality	expert_quality	quality_difference	abs_difference
410	410	0.0454	0.3030	-0.2577	0.2577
402	402	0.2430	0.0674	0.1756	0.1756
415	415	0.2353	0.0745	0.1607	0.1607
401	401	0.2053	0.0536	0.1517	0.1517
416	416	0.0023	0.0869	-0.0846	0.0846
418	418	0.0030	0.0772	-0.0743	0.0743
419	419	0.0057	0.0741	-0.0684	0.0684
406	406	0.0148	0.0811	-0.0663	0.0663
399	399	0.0575	0.0032	0.0543	0.0543
414	414	0.0436	0.0203	0.0233	0.0233

4. Discussion

4.1 Level of Agreement

The observed correlation of 0.123 ( $ρ"> ρ$ ) suggests

ifelse(correlation > 0.7, "strong", 
       ifelse(correlation > 0.4, "moderate", "weak"))

## [1] "weak"

agreement between student and expert judgments of explanation quality. This finding

ifelse(cor_test$p.value < 0.05, "is statistically significant", "is not statistically significant")

## [1] "is not statistically significant"

indicating that

ifelse(cor_test$p.value < 0.05, "students and experts share similar conceptions of explanation quality", "there may be fundamental differences in how students and experts evaluate explanations")

## [1] "there may be fundamental differences in how students and experts evaluate explanations"

4.2 Patterns in Disagreements

The largest disagreements occurred for explanations where:

Students rated explanations much higher than experts (positive differences)

Experts rated explanations much higher than students (negative differences)

These discrepancies may reflect different valuation of explanation features:

Students may prioritize clarity, simplicity, or familiarity

Experts may value mathematical rigor, completeness, or technical accuracy

4.3 Comparison with Evans et al. (2022)

Our findings

ifelse(correlation > 0.5, "align with", "contrast with")

## [1] "contrast with"

Evans et al.’s observation that students and experts generally agree on explanation quality. The

ifelse(correlation > 0.5, "moderate to
strong", "weak")

## [1] "weak"

correlation suggests

ifelse(correlation > 0.5,
"shared understanding", "different perspectives")

## [1] "different perspectives"

of what constitutes high-quality mathematical explanations.

5. Limitations

Partial overlap: Only r length(common_explanations) explanations were judged by both groups, limiting comparability

Context differences: Student and expert judgements were collected in different contexts

Sample size: The number of common items may affect correlation reliability

Model assumptions: Plackett-Luce assumes transitive preferences, which may not always hold

6. Conclusion

This analysis reveals

ifelse(correlation > 0.5, "substantial", "limited")

## [1] "limited"

agreement between student and expert judgments of mathematical explanation quality. The correlation of

round(correlation, 3)

## [1] 0.123

suggests that

ifelse(correlation > 0.5, "students and experts largely share similar
quality standards", "students and experts employ different criteria when
evaluating explanations")

## [1] "students and experts employ different criteria when\nevaluating explanations"

Educational implications:

ifelse(correlation > 0.5, "Student peer assessment may be reasonably reliable for explanation quality", "Caution is needed when using student peer assessment for explanation quality")

## [1] "Caution is needed when using student peer assessment for explanation quality"

ifelse(correlation > 0.5, "Shared understanding supports collaborative learning environments", "Different perspectives highlight need for explicit quality criteria")

## [1] "Different perspectives highlight need for explicit quality criteria"

Future research should investigate the specific features that students and experts value in mathematical explanations to better understand the sources of both agreement and disagreement.

Analysing Research Question 1 Using Quantitative Methods: Evidence from Statistical Modelling

Assignment 3: CJ Project - RQ1 Analysis

14 November 2025