Division data is very dirty
Created by: ajvondrak
Thanks for gathering all this data! I got curious about doing some analysis on it (e.g., percentiles by various criteria). Don't know if you'll be able to use any of the code I'm writing, per se, but I hope I can give you some sort of input of value.
Circa my build of the latest data, there are 2,551 distinct divisions in the lifters data. I figured that this was down to the how ad hoc divisions are across federations (everybody gets a
- Many specify the sex of the lifter, which should ideally be given by the
Sex
column.
[alex@pc openpowerlifting]$ cut -d',' -f3 build/openpowerlifting.csv | sort -u | grep -i 'men' | wc -l
436
[alex@pc openpowerlifting]$ cut -d',' -f3 build/openpowerlifting.csv | sort -u | grep -i 'men' | head -10
13-15 Junior Men
13-15 Junior Women
13-15 Men
13-15 Teen Men
13-15 Women
148Submaster Women 35-39
16-17 Junior Men
16-17 Men
16-17 Teen Men
16-18 Junior Men
- Some contain the weight class, which should ideally be given by the
WeightClassKg
column.
[alex@pc openpowerlifting]$ cut -d',' -f3 build/openpowerlifting.csv | sort -u | grep -i '198'
198.25 DL 40-44
198.25 IM 35-39
198.25 IM 50-54
198.25 IM 65-69
198.25 RB 14-15
198.25 RB 20-23
198.25 RB 45-49
198.25 RB 50-54
198.25 RB OPEN
198.25 SB 40-44
Heavy group (181 198 ) Wilks formula
Heavywt group (181 198 )
Heavywt women-148-198
Heavywt women-148-198 by Wilks formula
Heavywt women-198
Lightwt group (181 198)
Lightwt group (198 220) by Wilks formula
Medium group (198 220 242) Wilks formula
Middlewt group (198 220) by Wilks formula
- Many specify raw vs equipped, which should ideally be given by the
Equipment
column.
[alex@pc openpowerlifting]$ cut -d',' -f3 build/openpowerlifting.csv | sort -u | grep '\bR' | tail -10
R-T3
R-T-3
R-THW
R-T&J;
R-TJR
R-Var
R-Y
R-Y1
R-Y2
R-Y3
- There are many ways of spelling the same "core" divisions that are common across federations. Formatting of all kinds plays into this (parentheses, capitalization, spacing, punctuation, etc).
[alex@pc openpowerlifting]$ cut -d',' -f3 build/openpowerlifting.csv | sort -u | grep -i 'sub.\?junior' | grep -v -i 'amateur'
Subjunior
SubJunior
Sub-Junior
Subjuniors
Sub-Juniors
- So on and so forth. Normalization being the classic problem of data engineering and all.
Granted, there will still be some divisions that can't "cross over" between feds: people will put different age bounds on their various divisions (youth/teen/juniors/submasters/masters) that we couldn't pull strictly from the lifter's age, I spy some "Crossfit" divisions, yadda. But I think we can do a lot better than 2,551 distinct values.