Chapter 3 Data transformation

3.1 Step1: Change column names

The column names in the original dataset are too long and contain white spaces, which is inconvenient for future development.

The column names of the aggregated datafrome after transformation is:

## Rows: 1,831
## Columns: 11
## $ DBN                  <chr> NA, NA, "01M015", "01M019", "01M020", "01M034", "…
## $ name                 <chr> NA, NA, "P.S. 015 ROBERTO CLEMENTE", "P.S. 019 AS…
## $ parent_rr            <dbl> NA, NA, 91, 100, 58, 29, 80, 52, 79, 46, 57, 100,…
## $ teacher_rr           <dbl> NA, NA, 100, 93, 90, 100, 100, 96, 77, 93, 89, 10…
## $ student_rr           <chr> NA, NA, "N/A", "N/A", "N/A", "95", "N/A", "N/A", …
## $ collab_teachers      <chr> NA, NA, "4.0999999999999996", "4.53", "2.71", "2.…
## $ effective_schlead    <chr> NA, NA, "4.1900000000000004", "4.51", "2.98", "2.…
## $ rig_instruction      <chr> NA, NA, "4.0199999999999996", "4.8", "1.92", "2.1…
## $ supp_env             <chr> NA, NA, "N/A", "N/A", "N/A", "N/A", "N/A", "N/A",…
## $ family_communityties <chr> NA, NA, "4.18", "4.66", "3.84", "3.67", "N/A", "4…
## $ trust                <chr> NA, NA, "3.96", "3.76", "3.14", "2.38", "3.77", "…

3.2 Step 2: Transform survey response from student, teacher, and parents.

The original dataframe contains survey questions as the column names. After loading the data in r, the answer choices is read as values in the first row. Thus, we have combined the answer choices within the column name for data wrangling and for the uniqueness of column names. Additionally, for the convenience of data wrangling, we have replaced the survey questions to question index in the column names.

Additionally, the data types of many columns are incorrect – instead of character, they should be float. Therefore, we convert the data type of those columns to numeric.

A example of students’ response after data wrangling:

## # A tibble: 6 × 272
##   DBN    `School Name`       student_rr `1a-s_disagree` `1a-disagree` `1a-agree`
##   <chr>  <chr>                    <dbl>           <dbl>         <dbl>      <dbl>
## 1 01M034 P.S. 034 FRANKLIN …         95              11            23         45
## 2 01M140 P.S. 140 NATHAN ST…         87              14            47         63
## 3 01M184 P.S. 184M SHUANG W…        100              24            59        125
## 4 01M188 P.S. 188 THE ISLAN…         66               0             0          7
## 5 01M292 ORCHARD COLLEGIATE…         82               5            24         75
## 6 01M332 UNIVERSITY NEIGHBO…         92               5            21        111
## # … with 266 more variables: 1a-s_agree <dbl>, 1a-idk <dbl>,
## #   1b-s_disagree <dbl>, 1b-disagree <dbl>, 1b-agree <dbl>, 1b-s_agree <dbl>,
## #   1b-idk <dbl>, 1c-s_disagree <dbl>, 1c-disagree <dbl>, 1c-agree <dbl>,
## #   1c-s_agree <dbl>, 1c-idk <dbl>, 1d-s_disagree <dbl>, 1d-disagree <dbl>,
## #   1d-agree <dbl>, 1d-s_agree <dbl>, 1d-idk <dbl>, 1e-s_disagree <dbl>,
## #   1e-disagree <dbl>, 1e-agree <dbl>, 1e-s_agree <dbl>, 1e-idk <dbl>,
## #   1f-s_disagree <dbl>, 1f-disagree <dbl>, 1f-agree <dbl>, 1f-s_agree <dbl>, …

A example of parents’ response after data wrangling:

## # A tibble: 6 × 196
##   DBN    `School Name`        parent_rr `1a-s_disagree` `1a-disagree` `1a-agree`
##   <chr>  <chr>                    <dbl>           <dbl>         <dbl>      <dbl>
## 1 01M015 P.S. 015 ROBERTO CL…        91               2             4         52
## 2 01M019 P.S. 019 ASHER LEVY        100               3             1        112
## 3 01M020 P.S. 020 ANNA SILVER        58               4            13        119
## 4 01M034 P.S. 034 FRANKLIN D…        29               1             5         27
## 5 01M063 THE STAR ACADEMY - …        80               3             8         63
## 6 01M064 P.S. 064 ROBERT SIM…        52               2             2         42
## # … with 190 more variables: 1a-s_agree <dbl>, 1b-s_disagree <dbl>,
## #   1b-disagree <dbl>, 1b-agree <dbl>, 1b-s_agree <dbl>, 1c-s_disagree <dbl>,
## #   1c-disagree <dbl>, 1c-agree <dbl>, 1c-s_agree <dbl>, 1d-s_disagree <dbl>,
## #   1d-disagree <dbl>, 1d-agree <dbl>, 1d-s_agree <dbl>, 1e-s_disagree <dbl>,
## #   1e-disagree <dbl>, 1e-agree <dbl>, 1e-s_agree <dbl>, 1f-s_disagree <dbl>,
## #   1f-disagree <dbl>, 1f-agree <dbl>, 1f-s_agree <dbl>, 1g-s_disagree <dbl>,
## #   1g-disagree <dbl>, 1g-agree <dbl>, 1g-s_agree <dbl>, 1h-s_disagree <dbl>, …

A example of teachers’ response after data wrangling:

## # A tibble: 6 × 579
##   DBN    `School Name`        teacher_rr `1a-none` `1a-some` `1a-a_lot` `1a-all`
##   <chr>  <chr>                     <dbl>     <dbl>     <dbl>      <dbl>    <dbl>
## 1 01M015 P.S. 015 ROBERTO CL…        100         0         3         14       11
## 2 01M019 P.S. 019 ASHER LEVY          93         0         9         17       10
## 3 01M020 P.S. 020 ANNA SILVER         90         1        13         19        4
## 4 01M034 P.S. 034 FRANKLIN D…        100         2        14          9       12
## 5 01M063 THE STAR ACADEMY - …        100         0        12         11        2
## 6 01M064 P.S. 064 ROBERT SIM…         96         0         1          8       15
## # … with 572 more variables: 1b-none <dbl>, 1b-some <dbl>, 1b-a_lot <dbl>,
## #   1b-all <dbl>, 1c-none <dbl>, 1c-some <dbl>, 1c-a_lot <dbl>, 1c-all <dbl>,
## #   1d-none <dbl>, 1d-some <dbl>, 1d-a_lot <dbl>, 1d-all <dbl>, 1e-none <dbl>,
## #   1e-some <dbl>, 1e-a_lot <dbl>, 1e-all <dbl>, 2a-s_disagree <dbl>,
## #   2a-disagree <dbl>, 2a-agree <dbl>, 2a-s_agree <dbl>, 2a-idk <dbl>,
## #   2b-s_disagree <dbl>, 2b-disagree <dbl>, 2b-agree <dbl>, 2b-s_agree <dbl>,
## #   2b-idk <dbl>, 2c-s_disagree <dbl>, 2c-disagree <dbl>, 2c-agree <dbl>, …

3.3 Step 3: Categorizing data based on survey questions

The survey questions included in this survey is comprehensive and difficult to manage. After reading through our survey questions, based on the survey questions, we have decided to categorize the data into different categories, and choose to analyze several categories based on our interests and whether there is sufficient data support.

We realized that categorizing data based on survey questions could be highly objective and admit it as a limitation. We are open to any discussions and advice on categorizing survey questions.

Several questions are directly removed during the categorizing process. We have removed question 3, 4b, 4h, 14, 15, 16, 24, 25 from the teacher’s survey either because it’s not particularly relevant to survey questions from students’ and parents’ perspective, or due to significant amount of NAs. Questions 5d, 5e, 7, 8, 9, 10 are removed from the parents’ dataset due to similar reasons or because some questions are targeted towards specific populations (eg. pre-k, high school). Some questions may be removed in latter process due to the inconsistencies in answer choices under the same category of questions. For example, while both question a and b falls under the same category, question a may have five answer choices while question b only have four answer choices.

After discussion, here are some example of the aspects we have decided to look into are teaching quality (from all perspectives of students, parents, and teachers), student performance (based on the feedback of teachers), communication between school and family (mainly based on the response from parents and teachers), and lastly, we’ll look into the characteristics that may contribute to teaching quality, for example, trust and respect between teachers.

3.4 Step 4: Datasets that we decide to analyze

Note that some answer choice are renamed due to inconsistencies between questions. Additionally, some questions are deleted due to the same reason.

3.4.1 Communication with families from teachers’ and parents’ perspectives

## # A tibble: 6 × 13
##   DBN    `School Name`      teacher_rr t_s_disagree t_s_agree t_agree t_disagree
##   <chr>  <chr>                   <dbl>        <dbl>     <dbl>   <dbl>      <dbl>
## 1 01M015 P.S. 015 ROBERTO …        100            1        85     235          2
## 2 01M019 P.S. 019 ASHER LE…         93            2        95     284          7
## 3 01M020 P.S. 020 ANNA SIL…         90            1        79     270         10
## 4 01M034 P.S. 034 FRANKLIN…        100           10        83     275         41
## 5 01M063 THE STAR ACADEMY …        100            0        65     190          6
## 6 01M064 P.S. 064 ROBERT S…         96            2        83     205          9
## # … with 6 more variables: parent_rr <dbl>, p_s_disagree <dbl>,
## #   p_s_agree <dbl>, p_agree <dbl>, p_disagree <dbl>, p_idk <dbl>

3.4.2 Teaching quality and student performance

## # A tibble: 6 × 19
##   DBN    `School Name` parent_rr p_v_neagtive p_negative p_positive p_v_positive
##   <chr>  <chr>             <dbl>        <dbl>      <dbl>      <dbl>        <dbl>
## 1 01M034 P.S. 034 FRA…        29            1          5        130           59
## 2 01M140 P.S. 140 NAT…        57            5         16        331          174
## 3 01M184 P.S. 184M SH…        92            5         50        928          385
## 4 01M188 P.S. 188 THE…        99            2          3        513          198
## 5 01M292 ORCHARD COLL…        49            6         10        163           71
## 6 01M332 UNIVERSITY N…        76            4         23        285          124
## # … with 12 more variables: p_idk <dbl>, teacher_rr <dbl>, t_NAs <dbl>,
## #   t_v_negative <dbl>, t_v_positive <dbl>, t_positive <dbl>, t_negative <dbl>,
## #   student_rr <dbl>, st_v_negative <dbl>, st_v_positive <dbl>,
## #   st_positive <dbl>, st_negative <dbl>
## # A tibble: 6 × 8
##   DBN    `School Name`                teacher_rr t_none t_some t_lot t_all t_idk
##   <chr>  <chr>                             <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 01M015 P.S. 015 ROBERTO CLEMENTE           100      5     87   150    66     0
## 2 01M019 P.S. 019 ASHER LEVY                  93      2    112   212    89     0
## 3 01M020 P.S. 020 ANNA SILVER                 90     18    150   222    26     2
## 4 01M034 P.S. 034 FRANKLIN D. ROOSEV…        100     17    189    95   101     4
## 5 01M063 THE STAR ACADEMY - P.S.63           100      2    105   142    26     0
## 6 01M064 P.S. 064 ROBERT SIMON                96      0     59   134    71     0

3.4.3 Trust, respect, and collaboration between teachers

## # A tibble: 6 × 15
##   DBN    `School Name`      teacher_rr r_s_disagree r_s_agree r_agree r_disagree
##   <chr>  <chr>                   <dbl>        <dbl>     <dbl>   <dbl>      <dbl>
## 1 01M015 P.S. 015 ROBERTO …        100            2       111     337         16
## 2 01M019 P.S. 019 ASHER LE…         93            8       112     408         40
## 3 01M020 P.S. 020 ANNA SIL…         90           29       114     445         81
## 4 01M034 P.S. 034 FRANKLIN…        100           58        93     447        170
## 5 01M063 THE STAR ACADEMY …        100            6        84     289         22
## 6 01M064 P.S. 064 ROBERT S…         96            2       110     304          6
## # … with 8 more variables: c_s_disagree <dbl>, c_s_agree <dbl>, c_agree <dbl>,
## #   c_disagree <dbl>, t_s_disagree <dbl>, t_s_agree <dbl>, t_agree <dbl>,
## #   t_disagree <dbl>

3.4.4 Diversity and inclusion from students’ and teachers’ perspectives

## # A tibble: 6 × 14
##   DBN   `School Name` teacher_rr t_s_disagree t_s_agree t_agree t_disagree t_idk
##   <chr> <chr>              <dbl>        <dbl>     <dbl>   <dbl>      <dbl> <dbl>
## 1 01M0… P.S. 034 FRA…        100           25       196     624         97     4
## 2 01M1… P.S. 140 NAT…         89            1       162     512         17     2
## 3 01M1… P.S. 184M SH…        100            7       173     782         93     9
## 4 01M1… P.S. 188 THE…         69            1       249     547          2     0
## 5 01M2… ORCHARD COLL…        100            6       113     352         39    29
## 6 01M3… UNIVERSITY N…         86            0       189     457          3     7
## # … with 6 more variables: student_rr <dbl>, st_s_disagree <dbl>,
## #   st_s_agree <dbl>, st_agree <dbl>, st_disagree <dbl>, st_idk <dbl>

3.4.5 Bullying, harrasement, and any aggresive behaviors from teachers’ and students’ perspective

## # A tibble: 6 × 12
##   DBN    `School Name`       teacher_rr t_none t_rarely t_some t_most student_rr
##   <chr>  <chr>                    <dbl>  <dbl>    <dbl>  <dbl>  <dbl>      <dbl>
## 1 01M034 P.S. 034 FRANKLIN …        100      7        2     13     15         95
## 2 01M140 P.S. 140 NATHAN ST…         89      1       12     18      1         87
## 3 01M184 P.S. 184M SHUANG W…        100      1       10     40      5        100
## 4 01M188 P.S. 188 THE ISLAN…         69     11       13      2      1         66
## 5 01M292 ORCHARD COLLEGIATE…        100      3        9     11      1         82
## 6 01M332 UNIVERSITY NEIGHBO…         86      1        7     17      0         92
## # … with 4 more variables: st_none <dbl>, st_rarely <dbl>, st_some <dbl>,
## #   st_most <dbl>