Chapter 2 Data sources

2.2 A glimpse at our dataset

2.2.1 Dataframe on Aggregated Data

The survey contains a aggregated dataframe in which scores evaluating different dimensions of school performance are calculated based on students, teachers, and parents’ responses.

2.2.1.1 Meanings of columns

Before we start to perform any exploratory data analysis and visualization, we need to understand what does each column mean in our dataset. The three dataframes we mentioned above – Student, Teacher and Parent survey result – contains 11 columns:

  • DBN: Primary key for the dataset and identification of schools.

  • School Name: School Name

  • Total Parent/Teacher/Student Response Rate: Percentage rate of response from parent/teacher/student respectively

  • Collaborative Teachers Score

  • Effective School Leadership Score

  • Rigorous Instruction Score

  • Supportive Environment Score

  • Strong Family-Community Ties Score

  • Trust Score

Despite the aggregated dataframe should be exactly the same for all three datasets, it is different within each dataset The exact information on this dataframe from each dataset is displayed below. In summary, we concluded that the aggregated dataframe of teacher and parents’ consists of DBNs where evaluation scores are completely void. Thus, we decide to look into the aggregated dataframe provided in the student response dataset.

2.2.1.2 Student Reponse dataset

The student response result is a dataframe with 1831 observations and 11 columns. As the first two rows of the dataframe is completely empty, the dataset actually consists of 1829 observations. Some important information are shown below:

## tibble [1,831 × 11] (S3: tbl_df/tbl/data.frame)
##  $ DBN                               : chr [1:1831] NA NA "01M015" "01M019" ...
##  $ School Name                       : chr [1:1831] NA NA "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" ...
##  $ Total Parent 
## Response Rate    : num [1:1831] NA NA 91 100 58 29 80 52 79 46 ...
##  $ Total Teacher Response Rate       : num [1:1831] NA NA 100 93 90 100 100 96 77 93 ...
##  $ Total Student Response Rate       : chr [1:1831] NA NA "N/A" "N/A" ...
##  $ Collaborative Teachers Score      : chr [1:1831] NA NA "4.0999999999999996" "4.53" ...
##  $ Effective School Leadership Score : chr [1:1831] NA NA "4.1900000000000004" "4.51" ...
##  $ Rigorous Instruction Score        : chr [1:1831] NA NA "4.0199999999999996" "4.8" ...
##  $ Supportive Environment Score      : chr [1:1831] NA NA "N/A" "N/A" ...
##  $ Strong Family-Community Ties Score: chr [1:1831] NA NA "4.18" "4.66" ...
##  $ Trust Score                       : chr [1:1831] NA NA "3.96" "3.76" ...

2.2.1.3 Teacher Reponse dataset

Similarly, the teacher response result is a dataframe of 1,903 observations and 11 columns. Some important information are shown below:

## tibble [1,903 × 11] (S3: tbl_df/tbl/data.frame)
##  $ DBN                               : chr [1:1903] "01M015" "01M019" "01M020" "01M034" ...
##  $ School Name                       : chr [1:1903] "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" "P.S. 020 ANNA SILVER" "P.S. 034 FRANKLIN D. ROOSEVELT" ...
##  $ Total Parent 
## Response Rate    : num [1:1903] 91 100 58 29 80 52 79 46 57 100 ...
##  $ Total Teacher Response Rate       : num [1:1903] 100 93 90 100 100 96 77 93 89 100 ...
##  $ Total Student Response Rate       : chr [1:1903] "N/A" "N/A" "N/A" "95" ...
##  $ Collaborative Teachers Score      : chr [1:1903] "4.0999999999999996" "4.53" "2.71" "2.69" ...
##  $ Effective School Leadership Score : chr [1:1903] "4.1900000000000004" "4.51" "2.98" "2.59" ...
##  $ Rigorous Instruction Score        : chr [1:1903] "4.0199999999999996" "4.8" "1.92" "2.14" ...
##  $ Supportive Environment Score      : chr [1:1903] "N/A" "N/A" "N/A" "N/A" ...
##  $ Strong Family-Community Ties Score: chr [1:1903] "4.18" "4.66" "3.84" "3.67" ...
##  $ Trust Score                       : chr [1:1903] "3.96" "3.76" "3.14" "2.38" ...

2.2.1.4 Parent Reponse dataset

The parent response result is a dataframe of 2,946 observations and 11 columns. Some important information are shown below:

## tibble [2,946 × 11] (S3: tbl_df/tbl/data.frame)
##  $ DBN                               : chr [1:2946] "01M015" "01M019" "01M020" "01M034" ...
##  $ School Name                       : chr [1:2946] "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" "P.S. 020 ANNA SILVER" "P.S. 034 FRANKLIN D. ROOSEVELT" ...
##  $ Total Parent 
## Response Rate    : num [1:2946] 91 100 58 29 80 52 79 46 57 100 ...
##  $ Total Teacher Response Rate       : num [1:2946] 100 93 90 100 100 96 77 93 89 100 ...
##  $ Total Student Response Rate       : chr [1:2946] "N/A" "N/A" "N/A" "95" ...
##  $ Collaborative Teachers Score      : chr [1:2946] "4.0999999999999996" "4.53" "2.71" "2.69" ...
##  $ Effective School Leadership Score : chr [1:2946] "4.1900000000000004" "4.51" "2.98" "2.59" ...
##  $ Rigorous Instruction Score        : chr [1:2946] "4.0199999999999996" "4.8" "1.92" "2.14" ...
##  $ Supportive Environment Score      : chr [1:2946] "N/A" "N/A" "N/A" "N/A" ...
##  $ Strong Family-Community Ties Score: chr [1:2946] "4.18" "4.66" "3.84" "3.67" ...
##  $ Trust Score                       : chr [1:2946] "3.96" "3.76" "3.14" "2.38" ...

2.2.2 Data on Specific Question Responses

These dataframes shows the survey questions (all questions are multiple choice) and counts the number of responses that falls into each categories. Currently, each dataframe has questions as the column name, while the answer choices falls into the first row. Additionally, as each question may take up to five or six columns, there are hundreds of variables within each dataframe. Thus, after data cleaning, we have made choices to focus on some specific categories within each dataframes.

2.2.2.1 Student Reponse dataset

Student Reponse dataset is a dataframe of 1,111 observations and 272 columns. The actual observations should be 1109, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:

## # A tibble: 6 × 272
##   DBN    `School Name` `Total Student … `1a. This schoo… ...5  ...6  ...7  ...8 
##   <chr>  <chr>         <chr>            <chr>            <chr> <chr> <chr> <chr>
## 1 <NA>   <NA>          <NA>             Strongly disagr… Disa… Agree Stro… I do…
## 2 <NA>   <NA>          <NA>             <NA>             <NA>  <NA>  <NA>  <NA> 
## 3 01M034 P.S. 034 FRA… 95               11               23    45    18    17   
## 4 01M140 P.S. 140 NAT… 87               14               47    63    19    24   
## 5 01M184 P.S. 184M SH… 100              24               59    125   38    26   
## 6 01M188 P.S. 188 THE… 66               0                0     7     93    0    
## # … with 264 more variables:
## #   1b. The programs, classes, and activities at this school encourage students to develop talent outside academics. <chr>,
## #   ...10 <chr>, ...11 <chr>, ...12 <chr>, ...13 <chr>,
## #   1c. This school is kept clean. <chr>, ...15 <chr>, ...16 <chr>,
## #   ...17 <chr>, ...18 <chr>,
## #   1d. Most students at this school treat each other with respect. <chr>,
## #   ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, …

2.2.2.2 Teacher Reponse dataset

Student Reponse dataset is a dataframe of 1,905 observations and 579 columns. The actual observations should be 1903, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:

## # A tibble: 6 × 579
##   DBN    `School Name`   `Total Teacher … `1a. How many teach… ...5  ...6  ...7 
##   <chr>  <chr>                      <dbl> <chr>                <chr> <chr> <chr>
## 1 <NA>   <NA>                          NA None                 Some  A lot All  
## 2 <NA>   <NA>                          NA <NA>                 <NA>  <NA>  <NA> 
## 3 01M015 P.S. 015 ROBER…              100 0                    3     14    11   
## 4 01M019 P.S. 019 ASHER…               93 0                    9     17    10   
## 5 01M020 P.S. 020 ANNA …               90 1                    13    19    4    
## 6 01M034 P.S. 034 FRANK…              100 2                    14    9     12   
## # … with 572 more variables:
## #   1b. How many teachers at this school are actively trying to improve their teaching? <chr>,
## #   ...9 <chr>, ...10 <chr>, ...11 <chr>,
## #   1c. How many teachers at this school take responsibility for improving the school? <chr>,
## #   ...13 <chr>, ...14 <chr>, ...15 <chr>,
## #   1d. How many teachers at this school are eager to try new ideas? <chr>,
## #   ...17 <chr>, ...18 <chr>, ...19 <chr>, …

2.2.2.3 Parent Reponse dataset

Student Reponse dataset is a dataframe of 2948 observations and 196 columns. The actual observations should be 2946, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:

## # A tibble: 6 × 196
##   DBN    `School Name`  `Total Parent Re… `1a. School staff r… ...5  ...6  ...7 
##   <chr>  <chr>                      <dbl> <chr>                <chr> <chr> <chr>
## 1 <NA>   <NA>                          NA Strongly disagree    Disa… Agree Stro…
## 2 <NA>   <NA>                          NA <NA>                 <NA>  <NA>  <NA> 
## 3 01M015 P.S. 015 ROBE…                91 2                    4     52    65   
## 4 01M019 P.S. 019 ASHE…               100 3                    1     112   98   
## 5 01M020 P.S. 020 ANNA…                58 4                    13    119   82   
## 6 01M034 P.S. 034 FRAN…                29 1                    5     27    30   
## # … with 189 more variables:
## #   1b.  My child's school offers me opportunities to visit my child's classroom, such as observing instruction, participating in an activity with my child, etc. <chr>,
## #   ...9 <chr>, ...10 <chr>, ...11 <chr>,
## #   1c. My child’s school offers me the opportunity to volunteer time to support this school (for example, helping in classrooms, helping with school-wide events, etc.). <chr>,
## #   ...13 <chr>, ...14 <chr>, ...15 <chr>,
## #   1d. I am greeted warmly when I call or visit the school. <chr>,
## #   ...17 <chr>, ...18 <chr>, ...19 <chr>, …

2.3 Issues with our data

As described earlier, our data primarily consists of two parts, the aggregated dataframe and dataframes on specific survey responses. While the survey responses are categorical, the aggregated dataframe consists of numerical values calculated based on these survey response. We did not obtain specific information on how these numerical values are calculated, thus our analysis is primarily based on the dataframes of specific survey responses. In order to conduct the analysis, we have cleaned and wrangled the dataframes on specific survey questions based on our own understanding which could be subjective and biased.