Chapter 3 Data transformation
3.1 Step1: Change column names
The column names in the original dataset are too long and contain white spaces, which is inconvenient for future development.
The column names of the aggregated datafrome after transformation is:
## Rows: 1,831
## Columns: 11
## $ DBN <chr> NA, NA, "01M015", "01M019", "01M020", "01M034", "…
## $ name <chr> NA, NA, "P.S. 015 ROBERTO CLEMENTE", "P.S. 019 AS…
## $ parent_rr <dbl> NA, NA, 91, 100, 58, 29, 80, 52, 79, 46, 57, 100,…
## $ teacher_rr <dbl> NA, NA, 100, 93, 90, 100, 100, 96, 77, 93, 89, 10…
## $ student_rr <chr> NA, NA, "N/A", "N/A", "N/A", "95", "N/A", "N/A", …
## $ collab_teachers <chr> NA, NA, "4.0999999999999996", "4.53", "2.71", "2.…
## $ effective_schlead <chr> NA, NA, "4.1900000000000004", "4.51", "2.98", "2.…
## $ rig_instruction <chr> NA, NA, "4.0199999999999996", "4.8", "1.92", "2.1…
## $ supp_env <chr> NA, NA, "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",…
## $ family_communityties <chr> NA, NA, "4.18", "4.66", "3.84", "3.67", "N/A", "4…
## $ trust <chr> NA, NA, "3.96", "3.76", "3.14", "2.38", "3.77", "…
3.2 Step 2: Transform survey response from student, teacher, and parents.
The original dataframe contains survey questions as the column names. After loading the data in r, the answer choices is read as values in the first row. Thus, we have combined the answer choices within the column name for data wrangling and for the uniqueness of column names. Additionally, for the convenience of data wrangling, we have replaced the survey questions to question index in the column names.
Additionally, the data types of many columns are incorrect – instead of character, they should be float. Therefore, we convert the data type of those columns to numeric
.
A example of students’ response after data wrangling:
## # A tibble: 6 × 272
## DBN `School Name` student_rr `1a-s_disagree` `1a-disagree` `1a-agree`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01M034 P.S. 034 FRANKLIN … 95 11 23 45
## 2 01M140 P.S. 140 NATHAN ST… 87 14 47 63
## 3 01M184 P.S. 184M SHUANG W… 100 24 59 125
## 4 01M188 P.S. 188 THE ISLAN… 66 0 0 7
## 5 01M292 ORCHARD COLLEGIATE… 82 5 24 75
## 6 01M332 UNIVERSITY NEIGHBO… 92 5 21 111
## # … with 266 more variables: 1a-s_agree <dbl>, 1a-idk <dbl>,
## # 1b-s_disagree <dbl>, 1b-disagree <dbl>, 1b-agree <dbl>, 1b-s_agree <dbl>,
## # 1b-idk <dbl>, 1c-s_disagree <dbl>, 1c-disagree <dbl>, 1c-agree <dbl>,
## # 1c-s_agree <dbl>, 1c-idk <dbl>, 1d-s_disagree <dbl>, 1d-disagree <dbl>,
## # 1d-agree <dbl>, 1d-s_agree <dbl>, 1d-idk <dbl>, 1e-s_disagree <dbl>,
## # 1e-disagree <dbl>, 1e-agree <dbl>, 1e-s_agree <dbl>, 1e-idk <dbl>,
## # 1f-s_disagree <dbl>, 1f-disagree <dbl>, 1f-agree <dbl>, 1f-s_agree <dbl>, …
A example of parents’ response after data wrangling:
## # A tibble: 6 × 196
## DBN `School Name` parent_rr `1a-s_disagree` `1a-disagree` `1a-agree`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO CL… 91 2 4 52
## 2 01M019 P.S. 019 ASHER LEVY 100 3 1 112
## 3 01M020 P.S. 020 ANNA SILVER 58 4 13 119
## 4 01M034 P.S. 034 FRANKLIN D… 29 1 5 27
## 5 01M063 THE STAR ACADEMY - … 80 3 8 63
## 6 01M064 P.S. 064 ROBERT SIM… 52 2 2 42
## # … with 190 more variables: 1a-s_agree <dbl>, 1b-s_disagree <dbl>,
## # 1b-disagree <dbl>, 1b-agree <dbl>, 1b-s_agree <dbl>, 1c-s_disagree <dbl>,
## # 1c-disagree <dbl>, 1c-agree <dbl>, 1c-s_agree <dbl>, 1d-s_disagree <dbl>,
## # 1d-disagree <dbl>, 1d-agree <dbl>, 1d-s_agree <dbl>, 1e-s_disagree <dbl>,
## # 1e-disagree <dbl>, 1e-agree <dbl>, 1e-s_agree <dbl>, 1f-s_disagree <dbl>,
## # 1f-disagree <dbl>, 1f-agree <dbl>, 1f-s_agree <dbl>, 1g-s_disagree <dbl>,
## # 1g-disagree <dbl>, 1g-agree <dbl>, 1g-s_agree <dbl>, 1h-s_disagree <dbl>, …
A example of teachers’ response after data wrangling:
## # A tibble: 6 × 579
## DBN `School Name` teacher_rr `1a-none` `1a-some` `1a-a_lot` `1a-all`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO CL… 100 0 3 14 11
## 2 01M019 P.S. 019 ASHER LEVY 93 0 9 17 10
## 3 01M020 P.S. 020 ANNA SILVER 90 1 13 19 4
## 4 01M034 P.S. 034 FRANKLIN D… 100 2 14 9 12
## 5 01M063 THE STAR ACADEMY - … 100 0 12 11 2
## 6 01M064 P.S. 064 ROBERT SIM… 96 0 1 8 15
## # … with 572 more variables: 1b-none <dbl>, 1b-some <dbl>, 1b-a_lot <dbl>,
## # 1b-all <dbl>, 1c-none <dbl>, 1c-some <dbl>, 1c-a_lot <dbl>, 1c-all <dbl>,
## # 1d-none <dbl>, 1d-some <dbl>, 1d-a_lot <dbl>, 1d-all <dbl>, 1e-none <dbl>,
## # 1e-some <dbl>, 1e-a_lot <dbl>, 1e-all <dbl>, 2a-s_disagree <dbl>,
## # 2a-disagree <dbl>, 2a-agree <dbl>, 2a-s_agree <dbl>, 2a-idk <dbl>,
## # 2b-s_disagree <dbl>, 2b-disagree <dbl>, 2b-agree <dbl>, 2b-s_agree <dbl>,
## # 2b-idk <dbl>, 2c-s_disagree <dbl>, 2c-disagree <dbl>, 2c-agree <dbl>, …
3.3 Step 3: Categorizing data based on survey questions
The survey questions included in this survey is comprehensive and difficult to manage. After reading through our survey questions, based on the survey questions, we have decided to categorize the data into different categories, and choose to analyze several categories based on our interests and whether there is sufficient data support.
We realized that categorizing data based on survey questions could be highly objective and admit it as a limitation. We are open to any discussions and advice on categorizing survey questions.
Several questions are directly removed during the categorizing process. We have removed question 3, 4b, 4h, 14, 15, 16, 24, 25 from the teacher’s survey either because it’s not particularly relevant to survey questions from students’ and parents’ perspective, or due to significant amount of NAs. Questions 5d, 5e, 7, 8, 9, 10 are removed from the parents’ dataset due to similar reasons or because some questions are targeted towards specific populations (eg. pre-k, high school). Some questions may be removed in latter process due to the inconsistencies in answer choices under the same category of questions. For example, while both question a and b falls under the same category, question a may have five answer choices while question b only have four answer choices.
After discussion, here are some example of the aspects we have decided to look into are teaching quality (from all perspectives of students, parents, and teachers), student performance (based on the feedback of teachers), communication between school and family (mainly based on the response from parents and teachers), and lastly, we’ll look into the characteristics that may contribute to teaching quality, for example, trust and respect between teachers.
3.4 Step 4: Datasets that we decide to analyze
Note that some answer choice are renamed due to inconsistencies between questions. Additionally, some questions are deleted due to the same reason.
3.4.1 Communication with families from teachers’ and parents’ perspectives
## # A tibble: 6 × 13
## DBN `School Name` teacher_rr t_s_disagree t_s_agree t_agree t_disagree
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO … 100 1 85 235 2
## 2 01M019 P.S. 019 ASHER LE… 93 2 95 284 7
## 3 01M020 P.S. 020 ANNA SIL… 90 1 79 270 10
## 4 01M034 P.S. 034 FRANKLIN… 100 10 83 275 41
## 5 01M063 THE STAR ACADEMY … 100 0 65 190 6
## 6 01M064 P.S. 064 ROBERT S… 96 2 83 205 9
## # … with 6 more variables: parent_rr <dbl>, p_s_disagree <dbl>,
## # p_s_agree <dbl>, p_agree <dbl>, p_disagree <dbl>, p_idk <dbl>
3.4.2 Teaching quality and student performance
## # A tibble: 6 × 19
## DBN `School Name` parent_rr p_v_neagtive p_negative p_positive p_v_positive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M034 P.S. 034 FRA… 29 1 5 130 59
## 2 01M140 P.S. 140 NAT… 57 5 16 331 174
## 3 01M184 P.S. 184M SH… 92 5 50 928 385
## 4 01M188 P.S. 188 THE… 99 2 3 513 198
## 5 01M292 ORCHARD COLL… 49 6 10 163 71
## 6 01M332 UNIVERSITY N… 76 4 23 285 124
## # … with 12 more variables: p_idk <dbl>, teacher_rr <dbl>, t_NAs <dbl>,
## # t_v_negative <dbl>, t_v_positive <dbl>, t_positive <dbl>, t_negative <dbl>,
## # student_rr <dbl>, st_v_negative <dbl>, st_v_positive <dbl>,
## # st_positive <dbl>, st_negative <dbl>
## # A tibble: 6 × 8
## DBN `School Name` teacher_rr t_none t_some t_lot t_all t_idk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO CLEMENTE 100 5 87 150 66 0
## 2 01M019 P.S. 019 ASHER LEVY 93 2 112 212 89 0
## 3 01M020 P.S. 020 ANNA SILVER 90 18 150 222 26 2
## 4 01M034 P.S. 034 FRANKLIN D. ROOSEV… 100 17 189 95 101 4
## 5 01M063 THE STAR ACADEMY - P.S.63 100 2 105 142 26 0
## 6 01M064 P.S. 064 ROBERT SIMON 96 0 59 134 71 0
3.4.3 Trust, respect, and collaboration between teachers
## # A tibble: 6 × 15
## DBN `School Name` teacher_rr r_s_disagree r_s_agree r_agree r_disagree
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO … 100 2 111 337 16
## 2 01M019 P.S. 019 ASHER LE… 93 8 112 408 40
## 3 01M020 P.S. 020 ANNA SIL… 90 29 114 445 81
## 4 01M034 P.S. 034 FRANKLIN… 100 58 93 447 170
## 5 01M063 THE STAR ACADEMY … 100 6 84 289 22
## 6 01M064 P.S. 064 ROBERT S… 96 2 110 304 6
## # … with 8 more variables: c_s_disagree <dbl>, c_s_agree <dbl>, c_agree <dbl>,
## # c_disagree <dbl>, t_s_disagree <dbl>, t_s_agree <dbl>, t_agree <dbl>,
## # t_disagree <dbl>
3.4.4 Diversity and inclusion from students’ and teachers’ perspectives
## # A tibble: 6 × 14
## DBN `School Name` teacher_rr t_s_disagree t_s_agree t_agree t_disagree t_idk
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M0… P.S. 034 FRA… 100 25 196 624 97 4
## 2 01M1… P.S. 140 NAT… 89 1 162 512 17 2
## 3 01M1… P.S. 184M SH… 100 7 173 782 93 9
## 4 01M1… P.S. 188 THE… 69 1 249 547 2 0
## 5 01M2… ORCHARD COLL… 100 6 113 352 39 29
## 6 01M3… UNIVERSITY N… 86 0 189 457 3 7
## # … with 6 more variables: student_rr <dbl>, st_s_disagree <dbl>,
## # st_s_agree <dbl>, st_agree <dbl>, st_disagree <dbl>, st_idk <dbl>
3.4.5 Bullying, harrasement, and any aggresive behaviors from teachers’ and students’ perspective
## # A tibble: 6 × 12
## DBN `School Name` teacher_rr t_none t_rarely t_some t_most student_rr
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01M034 P.S. 034 FRANKLIN … 100 7 2 13 15 95
## 2 01M140 P.S. 140 NATHAN ST… 89 1 12 18 1 87
## 3 01M184 P.S. 184M SHUANG W… 100 1 10 40 5 100
## 4 01M188 P.S. 188 THE ISLAN… 69 11 13 2 1 66
## 5 01M292 ORCHARD COLLEGIATE… 100 3 9 11 1 82
## 6 01M332 UNIVERSITY NEIGHBO… 86 1 7 17 0 92
## # … with 4 more variables: st_none <dbl>, st_rarely <dbl>, st_some <dbl>,
## # st_most <dbl>
3.5 Link to code
You may find our code on data cleaning below:
https://github.com/hannawong/NYCSchoolSurvey/blob/main/03-cleaning.Rmd