Chapter 2 Data sources
2.1 Data source links
The data is provided by the NYC Department of Education and could be downloaded from NYC OpenData.
As shown in the FAQ document of the NYC School Survey, the survey is voluntary and all District 75 students are provided with survey materials. All participants (parents, students, and teachers) involved are confidential, while only the teachers’ responses are both anonymous and confidential. While the surveys may be completed online or by paper by parents and students, teachers are only allowed to take the survey online.
More info on this link: https://infohub.nyced.org/reports/school-quality/nyc-school-survey/survey-archives
The result of parent, teacher, and student responses of the NYC school survey can be found in the links listed below:
https://data.cityofnewyork.us/Education/2019-NYC-School-Survey-Teachers/uzmq-2khh
https://data.cityofnewyork.us/Education/2019-NYC-School-Survey-Parents/4fcz-jias
https://data.cityofnewyork.us/Education/2019-NYC-School-Survey-Student/k2zg-756q
We download the dataset as xlsx
format from the aforementioned websites.
2.2 A glimpse at our dataset
2.2.1 Dataframe on Aggregated Data
The survey contains a aggregated dataframe in which scores evaluating different dimensions of school performance are calculated based on students, teachers, and parents’ responses.
2.2.1.1 Meanings of columns
Before we start to perform any exploratory data analysis and visualization, we need to understand what does each column mean in our dataset. The three dataframes we mentioned above – Student, Teacher and Parent survey result – contains 11 columns:
DBN: Primary key for the dataset and identification of schools.
School Name: School Name
Total Parent/Teacher/Student Response Rate: Percentage rate of response from parent/teacher/student respectively
Collaborative Teachers Score
Effective School Leadership Score
Rigorous Instruction Score
Supportive Environment Score
Strong Family-Community Ties Score
Trust Score
Despite the aggregated dataframe should be exactly the same for all three datasets, it is different within each dataset The exact information on this dataframe from each dataset is displayed below. In summary, we concluded that the aggregated dataframe of teacher and parents’ consists of DBNs where evaluation scores are completely void. Thus, we decide to look into the aggregated dataframe provided in the student response dataset.
2.2.1.2 Student Reponse dataset
The student response result is a dataframe with 1831 observations and 11 columns. As the first two rows of the dataframe is completely empty, the dataset actually consists of 1829 observations. Some important information are shown below:
## tibble [1,831 × 11] (S3: tbl_df/tbl/data.frame)
## $ DBN : chr [1:1831] NA NA "01M015" "01M019" ...
## $ School Name : chr [1:1831] NA NA "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" ...
## $ Total Parent
## Response Rate : num [1:1831] NA NA 91 100 58 29 80 52 79 46 ...
## $ Total Teacher Response Rate : num [1:1831] NA NA 100 93 90 100 100 96 77 93 ...
## $ Total Student Response Rate : chr [1:1831] NA NA "N/A" "N/A" ...
## $ Collaborative Teachers Score : chr [1:1831] NA NA "4.0999999999999996" "4.53" ...
## $ Effective School Leadership Score : chr [1:1831] NA NA "4.1900000000000004" "4.51" ...
## $ Rigorous Instruction Score : chr [1:1831] NA NA "4.0199999999999996" "4.8" ...
## $ Supportive Environment Score : chr [1:1831] NA NA "N/A" "N/A" ...
## $ Strong Family-Community Ties Score: chr [1:1831] NA NA "4.18" "4.66" ...
## $ Trust Score : chr [1:1831] NA NA "3.96" "3.76" ...
2.2.1.3 Teacher Reponse dataset
Similarly, the teacher response result is a dataframe of 1,903 observations and 11 columns. Some important information are shown below:
## tibble [1,903 × 11] (S3: tbl_df/tbl/data.frame)
## $ DBN : chr [1:1903] "01M015" "01M019" "01M020" "01M034" ...
## $ School Name : chr [1:1903] "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" "P.S. 020 ANNA SILVER" "P.S. 034 FRANKLIN D. ROOSEVELT" ...
## $ Total Parent
## Response Rate : num [1:1903] 91 100 58 29 80 52 79 46 57 100 ...
## $ Total Teacher Response Rate : num [1:1903] 100 93 90 100 100 96 77 93 89 100 ...
## $ Total Student Response Rate : chr [1:1903] "N/A" "N/A" "N/A" "95" ...
## $ Collaborative Teachers Score : chr [1:1903] "4.0999999999999996" "4.53" "2.71" "2.69" ...
## $ Effective School Leadership Score : chr [1:1903] "4.1900000000000004" "4.51" "2.98" "2.59" ...
## $ Rigorous Instruction Score : chr [1:1903] "4.0199999999999996" "4.8" "1.92" "2.14" ...
## $ Supportive Environment Score : chr [1:1903] "N/A" "N/A" "N/A" "N/A" ...
## $ Strong Family-Community Ties Score: chr [1:1903] "4.18" "4.66" "3.84" "3.67" ...
## $ Trust Score : chr [1:1903] "3.96" "3.76" "3.14" "2.38" ...
2.2.1.4 Parent Reponse dataset
The parent response result is a dataframe of 2,946 observations and 11 columns. Some important information are shown below:
## tibble [2,946 × 11] (S3: tbl_df/tbl/data.frame)
## $ DBN : chr [1:2946] "01M015" "01M019" "01M020" "01M034" ...
## $ School Name : chr [1:2946] "P.S. 015 ROBERTO CLEMENTE" "P.S. 019 ASHER LEVY" "P.S. 020 ANNA SILVER" "P.S. 034 FRANKLIN D. ROOSEVELT" ...
## $ Total Parent
## Response Rate : num [1:2946] 91 100 58 29 80 52 79 46 57 100 ...
## $ Total Teacher Response Rate : num [1:2946] 100 93 90 100 100 96 77 93 89 100 ...
## $ Total Student Response Rate : chr [1:2946] "N/A" "N/A" "N/A" "95" ...
## $ Collaborative Teachers Score : chr [1:2946] "4.0999999999999996" "4.53" "2.71" "2.69" ...
## $ Effective School Leadership Score : chr [1:2946] "4.1900000000000004" "4.51" "2.98" "2.59" ...
## $ Rigorous Instruction Score : chr [1:2946] "4.0199999999999996" "4.8" "1.92" "2.14" ...
## $ Supportive Environment Score : chr [1:2946] "N/A" "N/A" "N/A" "N/A" ...
## $ Strong Family-Community Ties Score: chr [1:2946] "4.18" "4.66" "3.84" "3.67" ...
## $ Trust Score : chr [1:2946] "3.96" "3.76" "3.14" "2.38" ...
2.2.2 Data on Specific Question Responses
These dataframes shows the survey questions (all questions are multiple choice) and counts the number of responses that falls into each categories. Currently, each dataframe has questions as the column name, while the answer choices falls into the first row. Additionally, as each question may take up to five or six columns, there are hundreds of variables within each dataframe. Thus, after data cleaning, we have made choices to focus on some specific categories within each dataframes.
2.2.2.1 Student Reponse dataset
Student Reponse dataset is a dataframe of 1,111 observations and 272 columns. The actual observations should be 1109, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:
## # A tibble: 6 × 272
## DBN `School Name` `Total Student … `1a. This schoo… ...5 ...6 ...7 ...8
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> Strongly disagr… Disa… Agree Stro… I do…
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 01M034 P.S. 034 FRA… 95 11 23 45 18 17
## 4 01M140 P.S. 140 NAT… 87 14 47 63 19 24
## 5 01M184 P.S. 184M SH… 100 24 59 125 38 26
## 6 01M188 P.S. 188 THE… 66 0 0 7 93 0
## # … with 264 more variables:
## # 1b. The programs, classes, and activities at this school encourage students to develop talent outside academics. <chr>,
## # ...10 <chr>, ...11 <chr>, ...12 <chr>, ...13 <chr>,
## # 1c. This school is kept clean. <chr>, ...15 <chr>, ...16 <chr>,
## # ...17 <chr>, ...18 <chr>,
## # 1d. Most students at this school treat each other with respect. <chr>,
## # ...20 <chr>, ...21 <chr>, ...22 <chr>, ...23 <chr>, …
2.2.2.2 Teacher Reponse dataset
Student Reponse dataset is a dataframe of 1,905 observations and 579 columns. The actual observations should be 1903, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:
## # A tibble: 6 × 579
## DBN `School Name` `Total Teacher … `1a. How many teach… ...5 ...6 ...7
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> NA None Some A lot All
## 2 <NA> <NA> NA <NA> <NA> <NA> <NA>
## 3 01M015 P.S. 015 ROBER… 100 0 3 14 11
## 4 01M019 P.S. 019 ASHER… 93 0 9 17 10
## 5 01M020 P.S. 020 ANNA … 90 1 13 19 4
## 6 01M034 P.S. 034 FRANK… 100 2 14 9 12
## # … with 572 more variables:
## # 1b. How many teachers at this school are actively trying to improve their teaching? <chr>,
## # ...9 <chr>, ...10 <chr>, ...11 <chr>,
## # 1c. How many teachers at this school take responsibility for improving the school? <chr>,
## # ...13 <chr>, ...14 <chr>, ...15 <chr>,
## # 1d. How many teachers at this school are eager to try new ideas? <chr>,
## # ...17 <chr>, ...18 <chr>, ...19 <chr>, …
2.2.2.3 Parent Reponse dataset
Student Reponse dataset is a dataframe of 2948 observations and 196 columns. The actual observations should be 2946, as the first two rows are not comprised of statistics on survey response. Some examples of the dataset are shown below:
## # A tibble: 6 × 196
## DBN `School Name` `Total Parent Re… `1a. School staff r… ...5 ...6 ...7
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> NA Strongly disagree Disa… Agree Stro…
## 2 <NA> <NA> NA <NA> <NA> <NA> <NA>
## 3 01M015 P.S. 015 ROBE… 91 2 4 52 65
## 4 01M019 P.S. 019 ASHE… 100 3 1 112 98
## 5 01M020 P.S. 020 ANNA… 58 4 13 119 82
## 6 01M034 P.S. 034 FRAN… 29 1 5 27 30
## # … with 189 more variables:
## # 1b. My child's school offers me opportunities to visit my child's classroom, such as observing instruction, participating in an activity with my child, etc. <chr>,
## # ...9 <chr>, ...10 <chr>, ...11 <chr>,
## # 1c. My child’s school offers me the opportunity to volunteer time to support this school (for example, helping in classrooms, helping with school-wide events, etc.). <chr>,
## # ...13 <chr>, ...14 <chr>, ...15 <chr>,
## # 1d. I am greeted warmly when I call or visit the school. <chr>,
## # ...17 <chr>, ...18 <chr>, ...19 <chr>, …
2.3 Issues with our data
As described earlier, our data primarily consists of two parts, the aggregated dataframe and dataframes on specific survey responses. While the survey responses are categorical, the aggregated dataframe consists of numerical values calculated based on these survey response. We did not obtain specific information on how these numerical values are calculated, thus our analysis is primarily based on the dataframes of specific survey responses. In order to conduct the analysis, we have cleaned and wrangled the dataframes on specific survey questions based on our own understanding which could be subjective and biased.