5 minute read

Load Libraries

library(tidyverse)
library(dplyr)
library(lubridate)
library(ggrepel)
covid <- read_csv(file =
"https://data.virginia.gov/api/views/bre9-aqqr/rows.csv?accessType=DOWNLOAD")
  1. Extractracting the Covid data for 4 cities Norfolk, Virginia Beach, Chesapeake, Portsmouth into a dataset named four_city.
cities <- c("Norfolk", "Virginia Beach", "Chesapeake", "Portsmouth")
four_city <- covid %>%
  filter(Locality %in% cities) 

dim(four_city)
## [1] 3824    7
  1. Plotting the total number of deaths by day for four cities in the same graph.

The combined total death of the four cities is shown below:

totaldeath_byday <- four_city %>%
  mutate(`Report Date` = mdy(four_city$`Report Date`)) %>%
  group_by(`Report Date`) %>%
  summarise(total_deaths = as.integer(sum(`Deaths`))) %>%
  ggplot() +
  geom_line(mapping = aes(x =`Report Date`, y = total_deaths)) +
  theme(axis.text.x = element_text(angle=90, v=.2)) +
  labs(title="Reported total death by day for four cities")

totaldeath_byday

We can further divide the dataset into each of the four cities. Below is the total death by day for each of the four cities:

death_byday23 <- four_city %>%
  mutate(`Report Date` = mdy(four_city$`Report Date`)) %>%
  group_by(`Locality`,`Report Date`) %>%
  summarise(total_deaths = as.integer(sum(`Deaths`))) %>%
  ggplot() +
  geom_line(mapping = aes(x =`Report Date`, y = total_deaths, color=Locality)) +
  theme(axis.text.x = element_text(angle=90, v=.2)) +
  labs(title="Reported total death by day for each city")
  
death_byday23

  1. Plotting the proportion of total number of deaths over the hospitalization for four cities n the same graph. The combined total death of the four cities is shown below:
totaldeath_byday <- four_city %>%
  mutate(`Report Date` = mdy(four_city$`Report Date`)) %>%
  group_by(`Report Date`) %>%
  summarise(total_deaths = as.integer(sum(`Deaths`)),
            total_hospit = as.integer(sum(`Hospitalizations`))) %>%
  mutate(lagHospital = total_hospit - lag(total_hospit)) %>%
  ggplot(mapping=aes(x=`Report Date`)) +
  geom_col(mapping = aes(y = total_deaths)) +
  geom_line(mapping= aes(y = total_hospit), show.legend = TRUE , color = "red") +
  theme(axis.text.x = element_text(angle=90, v=.2)) +
  labs(title="Reported total death (bar) over hospitalization (line)", y ="N cases")

totaldeath_byday

Community Policing Data

We will doing some analysis based on a data collection consisting of all traffic and investigatory stops made in Virginia as aggregated by Virginia Department of State Police. You can download the data set from here https://data.virginia.gov/Public-Safety/Community-Policing-Data-July-1-2020-to-May-31-2022/2c96-texw using this R code

police <- read_csv(file = "https://data.virginia.gov/api/views/2c96-texw/rows.csv?accessType=DOWNLOAD")

The dimension of the following data set:

dim(police)
## [1] 1882854      20

Each variable(features) of the data set is described as:

</col> </col>
Column Name Description Type
INCIDENT DATE Indicates the date of the motor vehicle stop
AGENCY NAME Name of law enforcement agency
JURISDICTION Location of stop by city or county
REASON FOR STOP Indicates the initial reason for the motor vehicle/traffic stop
PERSON TYPE Indicates whether the person subject to the investigative detention stop is a Driver, Passenger, or Pedestrian/Individual
RACE Indicates the race of the Driver/Individual involved
ETHNICITY Indicates the ethnicity of the Driver/Individual involved
AGE Indicates the age of the Driver/Individual involved
GENDER Indicates the gender of the Driver/Individual involved
ENGLISH SPEAKING Indicates if the Driver/Individual speaks English
ACTION TAKEN Indicates the most serious action taken towards the Driver/Individual at the completion of the stop or as a result of the stop
VIOLATION TYPE Indicates if the violation was a local or commonwealth code (no longer collected as of July 1, 2021)
SPECIFIC VIOLATION Indicates the specific code section in connection with action taken
VIRGINIA CRIME CODE Indicates corresponding Virginia Crime Code (Optional)
PERSON SEARCHED? Indicates if the Driver/Individual was searched as a result of the stop
VEHICLE SEARCHED? Indicates if the vehicle was searched as a result of the stop
ADDITIONAL ARREST? Indicates a person OTHER THAN THE DRIVER was arrested as a result of the stop. (no longer collected as of July 1, 2021)
PHYSICAL FORCE BY OFFICER Indicates if the law-enforcement officer or State Police officer used physical force against the person
PHYSICAL FORCE BY SUBJECT Indicates if the subject used physical force against any officers
RESIDENCY Indicates the residency of the subject stopped

2.Using ggplot2 to make a bar chart for total stops of these locations, mapping ACTION TAKEN to the color of the bar chart.

Finding the top 20 counties/cities in VA with the most stops:

 total_20_stops <- police %>%
  count(`JURISDICTION`) %>%
  arrange(desc(n)) %>%
  head(20)
cat("The top 20 JURISDICTION based on # of stops over VA are: \n")
## The top 20 JURISDICTION based on # of stops over VA are:
total_20_stops

We then filter the data set to only include the top 20 jurisdictions:

#Filter the data with the top 20 stops
to_graph_stops <- police %>%
  filter(JURISDICTION %in%  total_20_stops$JURISDICTION) %>%
  group_by(JURISDICTION, `ACTION TAKEN`) %>%
  summarise(n=n()) %>%
  ggplot(aes(fill=`ACTION TAKEN`)) +
  geom_bar(aes(x=JURISDICTION, y = n), stat="identity") +
  theme(axis.text.x = element_text(angle=90, v=.2)) + 
  labs(title="Top 20 JURISDICTION of number of stops with Action Taken")

to_graph_stops

  1. Using ggplot2 to make a bar chart with decreasing order of the number of stops for each reason.

The percentage of the initial reasons for stops:

init_reasons <- police %>% 
  count(`REASON FOR STOP`, sort=TRUE) %>% 
  mutate(freq= ((n/sum(n)) * 100)) %>% 
  arrange(desc(n)) 

init_reasons

The number of stops for each reasons:

init_reasons %>% ggplot() +
  geom_col(mapping=aes(x=reorder(`REASON FOR STOP`, - n), y=n)) +
  theme(axis.text.x = element_text(angle=90, v=.2))  +
  labs(title="Number of stops for each reason", x="Reasons for Stop", y="Number of Stops")

  1. Using ggplot2 to make a plot number of stops by date all over Virginia.
stops_by_date <- police %>% 
  mutate(`INCIDENT DATE` = as.Date(mdy_hms(police$`INCIDENT DATE`))) %>%
  count(`INCIDENT DATE`, sort=TRUE) %>%
  ggplot() + 
  geom_col(mapping = aes(x=`INCIDENT DATE`, y = n))
  

stops_by_date

  1. Using ggplot2 make a histogram of stops by age.
ggplot(data = police) +
  geom_bar(mapping = aes(x=AGE))

Some of the data points have been set to AGE = 0. The distribution follows a normal left skew relationship between police stops and age. That is, a majority of the stops are for individuals between 20-30 years old. This may be attributed that most drivers are within this age group.

stops <- police[police$AGE == 0,]
cat("There are about", nrow(stops), "age misrepresented as 0's in the data set")
## There are about 38944 age misrepresented as 0's in the data set
  1. Using ggplot2 to create a pie chart of the stops by RACE, and labeling each piece of the chart by the percentage of the stops for each RACE.
stops_by_Race <- police %>%
  count(`RACE`) %>%
  mutate(freq = round((n/sum(n) * 100), digits=4)) %>%
  mutate(cs = rev(cumsum(rev(freq))),
         pos = freq/2 + lead(cs,1),
         pos = if_else(is.na(pos), freq/2, pos)) %>%
  ggplot(mapping=aes(x="", y=freq, fill=RACE)) +
  geom_col() +  
  geom_label_repel(aes(label=paste0(freq, "%"), y=pos),
                   force=.5, nudge_x = .5) +
  coord_polar(theta="y") +
  labs(title="Percentage of stops for by RACE")

stops_by_Race

  1. Using ggplot2 to create a pie chart of the stops by GENDER, and labeling each piece of the chart by the percentage of the stops for each GENDER.
stops_by_Gender  <- police %>%
  count(`GENDER`) %>%
  mutate(freq= round(n/sum(n)*100, digits=3)) %>% 
  mutate(cs = rev(cumsum(rev(freq))),
         pos = freq/2 + lead(cs,1),
         pos = if_else(is.na(pos), freq/2, pos)) %>%
  ggplot(mapping=aes(x="", y=freq, fill=GENDER)) +
  geom_col() +  
  geom_label_repel(aes(label=paste0(freq, "%"), y=pos),
                   force=.5, nudge_x = .5) +
  coord_polar(theta="y") +
  labs(title="Percentage of stops by Gender")

stops_by_Gender

  1. Using ggplot2 to create a stack bar chart for the number of stops by GENDER, stacked by ACTION TAKEN.
gender_stops <- police %>%
  group_by(`GENDER`) %>%
  ggplot() +
  geom_bar(mapping=aes(x=`GENDER`, fill=`ACTION TAKEN`)) +
  theme(axis.text.x = element_text(angle=90, v=.2)) +
  labs(title="Most common action taken at the stops by GENDER")

gender_stops

A majority of the Action Taken, regardless of Gender, is Citations/Summons. The second most ACTION TAKEN is Warning.

Updated: