Recently, I was involved in some annotation processes involving two coders and I needed to compute inter-rater reliability scores. There are multiple measures for calculating the agreement between two or more than two coders/annotators.

If you have a question regarding “which measure to use in your case?”, I would suggest reading (Hayes & Krippendorff, 2007) which compares different measures and provides suggestions on which to use when.

In this post, I am sharing some of our python code on calculating various measures for inter-rater reliability.

## Cohen’s Kappa

We will start with Cohen’s kappa. Let’s say we have two coders who have coded a particular phenomenon and assigned some code for 10 instances. Now let’s write the python code to compute cohen’s kappa.

You can use either `sklearn.metrics` or `nltk.agreement` to compute kappa. We will see examples using both of these packages.

``````from sklearn.metrics import cohen_kappa_score

coder1 = [1,0,2,0,1,1,2,0,1,1]
coder2 = [1,1,0,0,1,1,2,1,1,0]
score = cohen_kappa_score(coder1,coder2)

print('Cohen\'s Kappa:',score)``````
``Cohen's Kappa: 0.3220338983050848``

In order to use `nltk.agreement` package, we need to structure our coding data into a format of `[coder, instance, code]`. For instance, the first code in coder1 is 1 which will be formatted as `[1,1,1]` which means coder1 assigned `1` to the first instance.

Let’s convert our codes given in the above example in the format of `[coder,instance,code]`. Here we have two options to do that. I have included the first option for better understanding. Second option is a short one line solution to our problem.

``````coder1 = [1,0,2,0,1,1,2,0,1,1]

coder1_new = []
coder2_new = []
for i in range(len(coder1)):
coder1_new.append([1,i,coder1[i]])
coder2_new.append([2,i,coder2[i]])

formatted_codes = coder1_new + coder2_new
print(formatted_codes)``````
``[[1, 0, 1], [1, 1, 0], [1, 2, 2], [1, 3, 0], [1, 4, 1], [1, 5, 1], [1, 6, 2], [1, 7, 0], [1, 8, 1], [1, 9, 1], [2, 0, 1], [2, 1, 1], [2, 2, 0], [2, 3, 0], [2, 4, 1], [2, 5, 1], [2, 6, 2], [2, 7, 1], [2, 8, 1], [2, 9, 0]]``
``````
formatted_codes = [[1,i,coder1[i]] for i in range(len(coder1))] + [[2,i,coder2[i]] for i in range(len(coder2))]
print(formatted_codes)``````
``[[1, 0, 1], [1, 1, 0], [1, 2, 2], [1, 3, 0], [1, 4, 1], [1, 5, 1], [1, 6, 2], [1, 7, 0], [1, 8, 1], [1, 9, 1], [2, 0, 1], [2, 1, 1], [2, 2, 0], [2, 3, 0], [2, 4, 1], [2, 5, 1], [2, 6, 2], [2, 7, 1], [2, 8, 1], [2, 9, 0]]``

Now, we have our codes in the required format, we can compute cohen’s kappa using `nltk.agreement`.

``````from nltk import agreement

``Cohen's Kappa: 0.32203389830508466``

#### Cohen’s Kappa using CSV files

In this section, we will see how to compute cohen’s kappa from codes stored in CSV files. So let’s say we have two files (coder1.csv, coder2.csv). Each of these files has some columns representing a dimension. Below is the snapshot of such a file.

The files contain 10 columns each representing a dimension coded by first coder. We have a similar file for coder2 and now we want to calculate Cohen’s kappa for each of such dimensions.

``````SMU,CF,KE,ARG,STR,CO,u1,u2,u3,u4
1,1,1,0,0,1,2,1,1,0
1,1,1,0,0,2,1,2,1,1
2,2,1,-1,0,1,2,1,2,2
2,2,1,1,0,2,2,2,2,1
1,2,1,1,0,2,2,2,1,2
1,1,1,1,0,1,2,2,1,2
2,1,1,1,0,2,2,2,1,2
2,1,2,2,0,2,1,2,2,2
2,2,2,2,0,2,2,2,2,2``````

We will use `pandas` python package to load our CSV file and access each dimension code (Learn basics of Pandas Library).

``````import pandas as pd
from sklearn.metrics import cohen_kappa_score

dimensions = coder1.columns

#iterate for each dimension
for dim in dimensions:

dim_codes1 = coder1[dim]

dim_codes2 = coder2[dim]
print('Dimension:',dim)

score = cohen_kappa_score(dim_codes1,dim_codes2)

print(' ',score)``````
``````Dimension: SMU
0.3076923076923077
Dimension: CF
0.55
Dimension: KE
0.12903225806451613
Dimension: ARG
0.6896551724137931
Dimension: STR
0.0
Dimension: CO
-0.19999999999999996
Dimension: u1
0.0
Dimension: u2
0.0
Dimension: u3
0.3414634146341463
Dimension: u4
0.4375``````

## Fleiss’s Kappa

As per my understanding, Cohen’s Kappa can be used if you have codes from only two coders. In case, if you have codes from multiple coders then you need to use Fleiss’s kappa.

We will use `nltk.agreement` package for calculating Fleiss’s Kappa. So now we add one more coder’s data to our previous example.

``````from nltk import agreement

coder1 = [1,0,2,0,1,1,2,0,1,1]
coder2 = [1,1,0,0,1,1,2,1,1,0]
coder3 = [1,2,2,1,2,1,2,1,1,0]

formatted_codes = [[1,i,coder1[i]] for i in range(len(coder1))] + [[2,i,coder2[i]] for i in range(len(coder2))]  + [[3,i,coder3[i]] for i in range(len(coder3))]

``Fleiss's Kappa: 0.3010752688172044``

#### Fleiss’s Kappa using CSV files

Now, let’s say we have three CSV files, one from each coder. Each coder assigned codes on ten dimensions (as shown in the above example of CSV file). The following code compute Fleiss’s kappa among three coders for each dimension.

``````import pandas as pd
from nltk import agreement

dimensions = coder1.columns

for dim in dimensions:

dim_codes1 = coder1[dim]
dim_codes2 = coder2[dim]
dim_codes3 = coder3[dim]

formatted_codes = [[1,i,dim_codes1[i]] for i in range(len(dim_codes1))] + [[2,i,dim_codes2[i]] for i in range(len(dim_codes2))]  + [[3,i,dim_codes3[i]] for i in range(len(dim_codes3))]

print('Dimension:')

## Cronbach’s Alpha

Cronbach’s alpha is mostly used to measure the internal consistency of a survey or questionnaire. For this measure, I am using `Pingouin` package (link).

Let’s say we have data from a questionnaire (which has questions with Likert scale) in a CSV file. For example, I am using a dataset from `Pingouin` with some missing values.

``````import pingouin as pg

``````Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Q11
1.0,1,1.0,1,1.0,1,1,1,1.0,1,1
1.0,1,1.0,1,1.0,1,1,1,0.0,1,0
,0,1.0,1,,1,1,1,1.0,0,0
1.0,1,1.0,0,1.0,1,0,1,1.0,0,0
1.0,1,1.0,1,1.0,0,0,0,1.0,0,0
0.0,1,,0,1.0,1,1,1,0.0,0,0
1.0,1,1.0,1,0.0,0,1,0,0.0,0,0
1.0,1,1.0,1,1.0,0,0,0,0.0,0,0
0.0,1,0.0,1,1.0,0,0,0,0.0,1,0
1.0,0,0.0,1,0.0,1,0,0,,0,0
1.0,1,1.0,0,0.0,0,0,0,0.0,0,0
1.0,0,0.0,1,0.0,0,0,0,0.0,0,0``````
``pg.cronbach_alpha(data=data)``
``(0.732661, array([0.435, 0.909]))``

## Krippendorff’s Alpha & Scott’s Pi

We can use `nltk.agreement` python package for both of these measures. I will show you an example of that.

For `nltk.agreement`, we need our formatted data (what we did in the previous example?). Once we have our formatted data, we simply need to call `alpha` function to get the Krippendorff’s Alpha. Let’s see the python code.

``````from nltk import agreement

coder1 = [1,0,2,0,1,1,2,0,1,1]
coder2 = [1,1,0,0,1,1,2,1,1,0]
coder3 = [1,2,2,1,2,1,2,1,1,0]

formatted_codes = [[1,i,coder1[i]] for i in range(len(coder1))] + [[2,i,coder2[i]] for i in range(len(coder2))]  + [[3,i,coder3[i]] for i in range(len(coder3))]

``````Krippendorff's alpha: 0.30952380952380953
Scott's pi: 0.2857142857142859``````

## Inter-class correlation

I am using `Pingouin` package mentioned before as well. The function used is `intraclass_corr`. This function returns a `Pandas` Datafame having the following information (from R package `psych` documentation). Six cases are returned (ICC1, ICC2, ICC3, ICC1k, ICCk2, ICCk3) by the function and the following are the meaning for each case.

Shrout and Fleiss (1979) consider six cases of reliability of ratings done by k raters on n targets.

ICC1: Each target is rated by a different judge and the judges are selected at random. (This is a one-way ANOVA fixed effects model and is found by (MSB- MSW)/(MSB+ (nr-1)*MSW))

ICC2: A random sample of k judges rate each target. The measure is one of absolute agreement in the ratings. Found as (MSB- MSE)/(MSB + (nr-1)*MSE + nr*(MSJ-MSE)/nc)

ICC3: A fixed set of k judges rate each target. There is no generalization to a larger population of judges. (MSB – MSE)/(MSB+ (nr-1)*MSE)

Then, for each of these cases, is reliability to be estimated for a single rating or for the average of k ratings? (The 1 rating case is equivalent to the average intercorrelation, the k rating case to the Spearman Brown adjusted reliability.)

ICC1 is sensitive to differences in means between raters and is a measure of absolute agreement.

ICC2 and ICC3 remove mean differences between judges, but are sensitive to interactions of raters by judges. The difference between ICC2 and ICC3 is whether raters are seen as fixed or random effects.

ICC1k, ICC2k, ICC3K reflect the means of k raters.

The dataset from `Pingouin` has been used in the following example.

``````import pingouin as pg

icc = pg.intraclass_corr(data=data, targets='Wine', raters='Judge',ratings='Scores')
icc``````
`````` 	Type 	Description 	ICC 	F 	df1 	df2 	pval 	CI95%
0 	ICC1 	Single raters absolute 	0.728 	11.680 	7 	24 	0.000002 	[0.43, 0.93]
1 	ICC2 	Single random raters 	0.728 	11.788 	7 	21 	0.000005 	[0.43, 0.93]
2 	ICC3 	Single fixed raters 	0.730 	11.788 	7 	21 	0.000005 	[0.43, 0.93]
3 	ICC1k 	Average raters absolute 	0.914 	11.680 	7 	24 	0.000002 	[0.75, 0.98]
4 	ICC2k 	Average random raters 	0.914 	11.788 	7 	21 	0.000005 	[0.75, 0.98]
5 	ICC3k 	Average fixed raters 	0.915 	11.788 	7 	21 	0.000005 	[0.75, 0.98]``````

## References

1. Hayes, A. F., & Krippendorff, K. (2007). Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods and Measures, 1(1), 77–89. https://doi.org/10.1080/19312450709336664