16 minute read

Parking Violations in NYC FY23

Description

The following dataset was derived from: https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2023/pvqr-7yc4 . For each fiscal year, the number of citations will have more thn 15-20 million rows. The FY23 only contains about 3 million data rows due the fiscal year starting in July 1st, 2022.

Load Libraries

#if(!require("sf")) install.packages("sf")
#if(!require("leaflet")) install.packages("leaflet")
#if(!require("tigris")) install.packages("tigris")
#if(!require("httr")) install.packages("httr")

library(tidyverse)
library(dplyr)
library(lubridate)
library(ggrepel)
library(sf)
library(osmdata)
library(tidygeocoder)
library(httr)
library(rgdal)
library(broom)
library(rlang)
library(ggplot2)

Import Dataset

The following data set has been download at the time of October 8,2022. Any new updates will not be reflected on the data set.

parking_violations_FY23 <- read.csv("Parking_Violations_Issued_-_Fiscal_Year_2023.csv")

Descriptions

The data set has 3133085 and 43 columns/features.

The features description are:

</col> </col>
Source Column Name Description/Comment
SUMMONS-NUMBER UNIQUE IDENTIFIER OF SUMMONS
PLATE ID REGISTERED PLATE ID
REGISTRATION STATE STATE OF PLATE REGISTRATION
PLATE TYPE TYPE OF PLATE
ISSUE-DATE ISSUE DATE
VIOLATION-CODE TYPE OF VIOLATION
SUMM-VEH-BODY VEHICLE BODY TYPE WRITTEN ON SUMMONS (SEDAN, ETC.)
SUMM-VEH-MAKE MAKE OF CAR WRITTEN ON SUMMONS
ISSUING-AGENCY ISSUING AGENCY CODE
STREET-CODE1 GEOGRAPHICAL STREET CODE
STREET-CODE2 GEOGRAPHICAL STREET CODE
STREET-CODE3 GEOGRAPHICAL STREET CODE
VEHICLE EXPIRATION DATE VEHICLE EXPIRATION DATE ON SUMMONS
VIOLATION LOCATION GENERAL VIOLATION LOCATION
VIOLATION PRECINCT PRECINCT OF VIOLATION
ISSUER PRECINCT PRECINCT OF ISSUANCE
ISSUER CODE UNIQUE CODE IDENTIFYING ISSUING OFFICER
ISSUER COMMAND COMMAND OF ISSUING OFFICER
ISSUER SQUAD ISSUING OFFICER’S SQUAD
VIOLATION TIME TIME VIOLATION OCCURRED
TIME FIRST OBSERVED TIME OF INITIAL VIOLATION OBSERVATION
VIOLATION COUNTY COUNTY OF VIOLATION
FRONT-OF-OPPOSITE VIOLATION IN FRONT OF OR OPPOSITE
HOUSE NUMBER ADDRESS NUMBER OF VIOLATION
STREET NAME STREET NAME OF SUMMONS ISSUED
INTERSECTING STREET VIOLATION NEAR INTERSECTING STREET
DATE FIRST OBSERVED DATE OF INITIAL VIOLATION OBSERVATION
LAW SECTION SECTION OF VEHICLE & TRAFFIC LAW
SUB DIVISION SUB DIVISION ON SUMMONS IMAGE
VIOLATION LEGAL CODE TYPE OF LEGAL CODE
DAYS IN EFFECT DAYS PARKING IN EFFECT
FROM HOURS IN EFFECT START TIME POSTED FOR METER AND SIGN VIOLATIONS
TO HOURS IN EFFECT END TIME POSTED FOR METER AND SIGN VIOLATIONS
VEHICLE COLOR CAR COLOR WRITTEN ON SUMMONS
UNREGISTERED VEHICLE? UNREGISTERED VEHICLE INDICATOR
VEHICLE YEAR YEAR OF VEHICLE WRITTEN ON SUMMONS
METER NUMBER PARKING METER NUMBER FOR METER VIOLATIONS
FEET FROM NUMBER OF FEET FROM CURB FOR HYDRANT VIOLATION
VIOLATION POST CODE LOCATION POST CODE OF VIOLATION
VIOLATION DESCRIPTION DESCRIPTION OF VIOLATION
NO STANDING OR STOPPING VIOLATION SUMMONS INDICTOR FOR NO STANDING OR STOPPING
HYDRANT VIOLATION SUMMONS INDICATOR FOR HYDRANT VIOLATION
DOUBLE PARKING VIOLATION SUMMONS INDICATOR FOR DOUBLE PARKING VIOLATION

Parking Violations based on Date-Times

Preprocessing

Before we begin the following analysis, it is important to preprocess the dataset to reflect the current fiscal year 2023. New York City fiscal year begins on July 1st of each year and ends on June 30th of the following year. Hence the FY2023 began in July 1, 2022 and will end in June 30th 2023. However, the following data set has parking violation issue dates that are misrepresented as other years. Hence, it is important to remove the misrepresented years in the data set first.

First we change the Report Date to date format and create a new column for years.

temp_parking_violations_FY23 <- parking_violations_FY23 %>%
  mutate(`Issue.Date` = mdy(parking_violations_FY23$`Issue.Date`)) %>%
  mutate(`Issue.year` = as.factor(year(`Issue.Date`))) %>%
  select(`Issue.year`)

As show below. There are years not issued within the Fiscal year of 2023.

table(temp_parking_violations_FY23)
## temp_parking_violations_FY23
##    2000    2012    2014    2017    2018    2019    2020    2021    2022    2023 
##       9       3       1       1       1       2      38     347 3132474     157 
##    2024    2025    2026    2027    2028    2029    2031    2049    2066 
##      12       6       1      24       2       3       2       1       1

The next step is to filter the data set within the Fiscal year time frame 2022-06-01 to the time of when the data set was downloaded [2022-10-08].

new_parking_violations_FY23 <- parking_violations_FY23 %>%
  mutate(`Issue.Date` = mdy(parking_violations_FY23$`Issue.Date`),
         `Issue.weekday` = wday(`Issue.Date`, label=TRUE)) %>%
  filter(`Issue.Date` >= mdy("06/01/2022") & `Issue.Date` <= mdy("10/08/2022"))

cat("The new dimension of the data set: \n");dim(new_parking_violations_FY23)
## The new dimension of the data set:
## [1] 3131197      44

Furthermore we convert the feature Violation.Time to a date_time. For example, the following data populates the feature: 1037A, 1205P, 1052A. We can infer that the A represents AM and P represents PM.
We can use lubdridate parse_date_time code IMp to convert the timestamp. However, it requires that A -> AM, and P -> PM. Hence, we will add concatenate M to the end of the strings.

#Convert the time using parse_date_time
new_parking_violations_FY23 <- new_parking_violations_FY23 %>%
  mutate(Violation.Time = str_c(Violation.Time, "M"),
         Violation.Time = parse_date_time(Violation.Time, "IMp", tz="EST"),
         Violation.Time.Hour = hour(Violation.Time))

#str(new_parking_violations_FY23$Violation.Time.Hour)

Parking violations vs dates Analysis

We will create a bar plot of the parking violations issue by dates, as shown below:

violations_by_date <- new_parking_violations_FY23 %>% 
  count(`Issue.Date`, sort=TRUE) %>%
  ggplot() + 
  geom_col(mapping = aes(x=`Issue.Date`, y = n)) +
  labs(title = "Number of violations per day") 
violations_by_date

We would expect the violations to be higher in the summer, but it seems that the violations issues remain consistent in the 4 months of data. This may be attributed that most tourist in the summer do not drive into NYC. Likewise, the data for October may be absent if the previous month data is only uploaded at the first of the next month. We may not see October data until the start of November.

Parking Violations vs weekdays

Next we analyze the difference by weekday:

violations_by_weekday <- new_parking_violations_FY23 %>% 
  count(`Issue.weekday`) %>%
  ggplot() +
  geom_col(mapping=aes(x=`Issue.weekday`, y=n))
violations_by_weekday

During the weekdays (M-F), the number of violations is higher than in the weekends. This may be due to the work week, where more people are moving.

Parking by Hours (EST)

violations_by_hour <- new_parking_violations_FY23 %>% 
  count(`Violation.Time.Hour`) %>% 
  ggplot() +
  geom_line(mapping = aes(x=`Violation.Time.Hour`, y=n))
violations_by_hour

The majority of the violations occur during 9pm - 3pm, which may be due to the work hours.


Parking Violations based on Vehicle data

Parking Violations by Registraton state

violations_by_registration_state <- new_parking_violations_FY23 %>% 
  count(`Registration.State`, sort=TRUE) %>%
  head(10) %>%
  ggplot() + 
  geom_col(mapping = aes(x=reorder(`Registration.State`, -n), y = n)) +
  labs(title = "Parking violations by Registrations")  +
  theme(axis.text.x = element_text(angle=90, v=.2)) 
violations_by_registration_state

Not surprisingly, a majority of the violations are New York held license plate. However, there are other registered license plate not from New York, especially Florida. We can only assume these cars are from tourists.

Parking Violations by Make

violations_by_make <- new_parking_violations_FY23 %>% 
  count(`Vehicle.Make`, sort=TRUE) %>%
  head(5) %>%
  arrange((`Vehicle.Make`))  %>%
  mutate(freq = round((n/sum(n) * 100), digits=4)) %>%
  mutate(cs = rev(cumsum(rev(freq))),
         pos = freq/2 + lead(cs,1),
         pos = if_else(is.na(pos), freq/2, pos))
 graph_violations_by_make <- violations_by_make %>%
  ggplot(mapping=aes(x="", y=freq, fill=`Vehicle.Make`)) +
  geom_col() +  
  geom_label_repel(aes(label=paste0(freq, "%"), y=pos),
                   force=.5, nudge_x = .5, nudge_y = .5) +
  coord_polar(theta="y") +
  labs(title="Percentage of Violations by the top 5 makes Makes")
graph_violations_by_make

Toyota and Honda are the most popular car to have violations.

Parking Violations by Manufactured Years of top 5 Makes

We filter some vehicles misrepresented as manufactured beyond 2024. Note that car manufactures releases the next year cars around the summer. For example, a Nissan Altima 2023 may come out on the summer of 2022. WE discretize the values into the folowing ranges: pre-1990, 1991-2000, 2001-2010, 2011-2020, 2021+

#Discretize into ranges 
to_graph_violations <- new_parking_violations_FY23 %>%
  filter(`Vehicle.Make` %in%  violations_by_make$`Vehicle.Make`,
         `Vehicle.Year` < 2024 & `Vehicle.Year` != 0)  %>%
  mutate(`Discretized.Year` = case_when(
           between(`Vehicle.Year`, 1970, 1990) ~ "pre-1990",
           between(`Vehicle.Year`, 1991, 2000) ~ "1991-2000",
           between(`Vehicle.Year`, 2001, 2010) ~ "2001-2010",
           between(`Vehicle.Year`, 2011, 2020) ~ "2011-2020",
           between(`Vehicle.Year`, 2021, 2023) ~ "post-2021",
           TRUE ~ NA_character_)) %>%
  count(`Vehicle.Make`, `Discretized.Year`, sort= TRUE) %>%
  ggplot(aes(fill=`Discretized.Year`)) +
  geom_bar(aes(x=`Vehicle.Make`, y = n), stat="identity") +
  theme(axis.text.x = element_text(angle=90, v=.2)) + 
  labs(title="Top 5 Vehicle makes by Vehicle year")

to_graph_violations

A majority of the violations are from each of the top 5 makes is from cars manufactured 2011-2020.

Parking Violations by Plate

most_issued_car <- new_parking_violations_FY23 %>%
  count(`Plate.ID`, sort=TRUE) %>%
  head(10)
most_issued_car

A majority of normal individuals have zero to a few violations in a span of their lifetime However, these top 9 License plate have hundreds of citations/violations already in a span of 4 months! Repeated crime breakerss…..


Parking violations based on Violation Code and Issuer Agency

There are over 100 violations codes listed. We examine the top 10 violations:

top_10_violations <- new_parking_violations_FY23 %>% 
  count(`Violation.Code`, sort=TRUE) %>%
  head(10)
top_10_violations

Top 10 violations and their respective names and fines according to NYC:

Violation.Code Violation Description Violation Fine Amount ($)
5 BUS LANE VIOLATION 50
7 FAILURE TO STOP AT RED LIGHT 50
14 NO STANDING-DAY/TIME LIMITS 115
20 NO PARKING-DAY/TIME LIMITS 60-65
21 NO PARKING-STREET CLEANING 45-65
36 PHTO SCHOOL ZN SPEED VIOLATION 50
38 FAIL TO DSPLY MUNI METER RECPT 35-65
40 FIRE HYDRANT 115
70 REG. STICKER-EXPIRED/MISSING 65
71 INSP. STICKER-EXPIRED/MISSING 65

We filter the following top 10 violations and get the top 5 Issuing Agencies.

top_5_agency <- new_parking_violations_FY23 %>% 
  filter(Violation.Code  %in%  top_10_violations$Violation.Code) %>%
  count(`Issuing.Agency`, sort=TRUE) %>%
  head(5)
top_5_agency

The following top 5 issuing agencies and their respective names:

  • K-PARKS DEPARTMENT
  • P-POLICE DEPARTMENT
  • S-DEPARTMENT OF SANITATION,
  • T-TRAFFIC,
  • V-DEPARTMENT OF TRANSPORTATION,

Then, we create a stack chart of the top 10 violations and top 5 issuing agencies.

to_graph <- new_parking_violations_FY23 %>%
  filter(Violation.Code  %in%  top_10_violations$Violation.Code) %>%
  filter(Issuing.Agency  %in%  top_5_agency$Issuing.Agency) %>%
  mutate(Violation.Code = as.factor(Violation.Code)) %>%
  count(Issuing.Agency,Violation.Code) %>%
  ggplot(aes(fill=`Violation.Code`)) +
  geom_bar(aes(x=Issuing.Agency, y = n), stat="identity") +
  theme(axis.text.x = element_text(angle=90, v=.2)) + 
  labs(title="Top 10 violations with the top 5 agencies that issues them")
to_graph

The Department of Transportation issues the most violations, a majority of which are vehicles speeding on a school zone caught by cameras!


Parking Violations based on Location Data

Data pre-processing

The goal of this section is to create a mapped data of the parking violations in NYC. However, this is a major undertaking due to the lack of latitude and longitude features in the dataset. The parking violations have locations, but is represented into three different features: House.Number, Street.Name, Intersecting.Street.

An example output of the three features:

geocode_NYC <- new_parking_violations_FY23 %>%
  select(House.Number, Street.Name, Intersecting.Street) %>%
  head(10)
geocode_NYC

As shown, if the violation occurred in-front of a house or building, it is possible to combine House.Number and Street.Name to find the general location. However, if the violation occurred in the intersection, House.Number will be most likely be empty, only giving us the intersecting streets (Street.Name and Intersecting.Street ), which may be difficult to convert to a general location of latitude and longitude data.

For simplicity (and ease), we only consider the locations with given House.Number, Street.Name, Violation County. We will use Violation County because it will narrow our search to a particular county when using the function geocoder later.

Selecting just the House.Number, Street.Name, Violation County:

geocode_NYC_house.street <- new_parking_violations_FY23 %>%
  select(House.Number, Street.Name, Violation.County) 
head(geocode_NYC_house.street,4)

As seen above, unfortunately, some data in House.Number, Street.Name, or Violation County may be empty (""), not NA.

We need to remove the empty cells ("") belonging to both feature:

geocode_NYC_house.street <- geocode_NYC_house.street[geocode_NYC_house.street$House.Number != "",]
geocode_NYC_house.street <- geocode_NYC_house.street[geocode_NYC_house.street$Street.Name != "", ] 
geocode_NYC_house.street <- na.omit(geocode_NYC_house.street)
print("As shown below, the new data set:")
## [1] "As shown below, the new data set:"
head(geocode_NYC_house.street,5)

The dataset was cleaned for missing values, reducing our dimension to:

dim(geocode_NYC_house.street)
## [1] 1673269       3

From the step above, Violation County was not cleaned for any missing data or misrepresented values. The approach to Violation County is different. As shown below, the unique values of Violation County:

unique(geocode_NYC_house.street$Violation.County)
##  [1] "NY"    "BX"    "K"     ""      "Q"     "R"     "MS"    "Kings" "Qns"  
## [10] "Bronx" "Rich"  "QNS"

Among the different values that Violation.County can represent, these are the most important to consider to help us geocode the addresses:

K abd Kings = Brooklyn

BX and Bronx = Bronx

NY = NY

Q and QNS = Queens

R, Rich = Staten Island

We filter and recode the different unique values of Violation County into their respective true names:

#Discretize the test set to be used for later problem

#geocode_NYC_house.street_Manhattan to map Manhattan later
geocode_NYC_house.street_Manhattan <- within(geocode_NYC_house.street,
                      {
                      County <- NA 
                      County[Violation.County == "NY"] <- "NY"
                      }) %>%
  na.omit()

#geocode_NYC_house.street to map the 5 counties with map overlay
geocode_NYC_house.street <- within(geocode_NYC_house.street,
                      {
                      County <- NA 
                      County[Violation.County == "NY"] <- "NY"
                      County[Violation.County == "BX" | Violation.County == "Bronx"] <- "Bronx"
                      County[Violation.County == "K" | Violation.County == "Kings"] <- "Brooklyn"
                      County[Violation.County == "Rich" | Violation.County == "R"] <- "Staten Island"
                      County[Violation.County == "QNS" | Violation.County == "Q"] <- "Queens"
                      }) %>%
  na.omit()

Due to filtering the data set, the new dimension is:

dim(geocode_NYC_house.street)
## [1] 1652703       4

We can confirm that we only have the 4 counties in interest:

unique(geocode_NYC_house.street$County)
## [1] "NY"            "Bronx"         "Brooklyn"      "Queens"       
## [5] "Staten Island"

The next step is to combine the following House.Number, Street.Name, and County into a single column to be used for geocoding. Likewise, “raw” duplicated Full_Address values are removed.

full_geocode_address <- geocode_NYC_house.street %>%
  mutate(Street = str_c(House.Number, Street.Name, sep=" "),
         County_State = str_c(County, "NY" , sep=","),
         Full_Address = str_c(Street, County_State, sep = ",")) %>%
  select(Full_Address) %>%
  distinct(Full_Address)

Unfortunately, we cannot differentiate between different lemmatized values. For example, Washington Square may be present in the data set as Washington SQ or Washington Square. Otherwise, the “raw” distinct have reduced our data set.

The concatenated dataset now looks:

head(full_geocode_address,5)

The new dimension:

dim(full_geocode_address)
## [1] 367777      1

Geocoding

Geocoding is the process of converting the street addressing to physical longitude and latitude data. In this step, we process the address that was created from the above steps

Important Note: Everything in this section can implemented easier if the library ggmap did not rely on Google API Query to get information. Unfortunately, to request geocode in ggmap requires a register GOOGLE API Key, which requires Billing($$).

The library tidygeocoder provides a geocoding function through Querying the United States Census Bureau(https://geocoding.geo.census.gov). However, unlike the Google API Query, the Census Bureau is limited to process a batch of 10,000. It also requires no API Key!.

Hence, to batch:

# The batch limit is 10,000 for Census Bureau
batchsize <- 10000
groups <- ceiling(nrow(full_geocode_address)/batchsize)

#We split it into n groups
fils <- split(full_geocode_address, factor(sort(rank(row.names(full_geocode_address))%%groups)))

After creating the batch list needed for query, we will now apply it to geocode. Since it only accepts tibbles, we create a list of tibbles and apply the function:

get_lat_long <- function(Full_Address) {
  tmp <- tibble(Full_Address)
  tmp <- tmp %>% 
    tidygeocoder::geocode(address =Full_Address, method = "census", verbose = FALSE)
}

Using lapply to recursively unpack the list of list and apply the above function.

Important Note: Since this requires batching queries, it takes a lot of time to iterate and retrieve data, hence we will eval = FALSE this chunk so we don’t have to rerun it again. We save the result as a csv and read it again to save time. Every 10,000 batch on average took 4-5 minutes. Hence, 50 bacthes can take up to 3-4 hours for every rerun!

#retrives Longitude and latitude
newLongandLat <- lapply(fils, lapply, get_lat_long)  

# Flatten the list of tibbles to a single list and bind them:
flat_location <- bind_rows(flatten(newLongandLat))

#Not all location will return a long and lat due to data error
flat_location <- na.omit(flat_location)

# Write to csv
write.csv(flat_location, "NYC_longitude_latitude.csv", row.names = FALSE)

Read the output csv file:

flat_csv_long_lat <- read.csv("NYC_longitude_latitude.csv")

Ensure that the long and lat are within respective longitude and latitude ranges of New York City:

flat_csv_long_lat <- flat_csv_long_lat %>%
  filter(`long` >= -75.0 & `long` <= -73.0,
         `lat` >= 40.4 & `lat` <=  41.0)

The dim of the long_lat

dim(flat_csv_long_lat)
## [1] 258010      3

After geocoding the data set now has longitude and latitude data:

head(flat_csv_long_lat,10)

Downloading the New York City Shapefile

As mentioned earlier, the library ggmap would make a lot of steps simpler if it did not require a registered GOOGLE API KEY (which requires paying for the service). One alternative method is to download the New York city shapefiles:

Note: since the AWS Access Key expires, query the following html and copy that into the code. The hard coded query will change per each rerun.

# If rerunning, query the following address in the browser and copy the resulting address into map <- readOG()
# ('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')

map <- rgdal::readOGR("https://data-beta-nyc-files.s3.amazonaws.com/resources/35dd04fb-81b3-479b-a074-a27a37888ce7/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson?Signature=VxZ1OST93MNZN1e4V7se6haki2A%3D&Expires=1666721574&AWSAccessKeyId=AKIAWM5UKMRH2KITC3QA")
## OGR data source with driver: GeoJSON 
## Source: "https://data-beta-nyc-files.s3.amazonaws.com/resources/35dd04fb-81b3-479b-a074-a27a37888ce7/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson?Signature=VxZ1OST93MNZN1e4V7se6haki2A%3D&Expires=1666721574&AWSAccessKeyId=AKIAWM5UKMRH2KITC3QA", layer: "OGRGeoJSON"
## with 310 features
## It has 4 fields

Final Map

nyc_neighborhoods_df <- tidy(map)
ggplot() + 
  geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) +
  geom_point(data=flat_csv_long_lat, aes(x=long, y = lat), alpha = 1/20, size=.0001, color = "red") +
  theme_void()

We can see to the right side, Queens County, having very sparse data points. This can be attributed to the fact that Queens County makes up from other smaller ‘Neighborhoods’. For example, [147-20 Cherry Ave, Flushing, NY] is in Queens County but Flushing Neighborhood. Unlike the other four counties, we cannot standardize Queens County as [Full Address, Queens, NY] because it will return a lat and long of NA. For example, [147-20 Cherry Ave, Queens, NY] will return NA, even if it belongs in Queens County.

Mapping Manhattan using Point w/o map overlay

In this section, we focus on mapping Manhattan County only without the map overlay.

Read the output csv file:

Manhattan_geocode <- flat_csv_long_lat[grep("NY,NY", flat_csv_long_lat$Full_Address), ]

Ensure that the long and lat are within respective longitude and latitude ranges of New York City:

Manhattan_geocode <- Manhattan_geocode %>%
  filter(`long` >= -75.0 & `long` <= -73.0,
         `lat` >= 40.4 & `lat` <=  41.0)

The dim of the long_lat

dim(Manhattan_geocode)
## [1] 64146     3

After geocoding the data set now has longitude and latitude data:

head(Manhattan_geocode,10)
# For centering the ggplot
min_long <- min(Manhattan_geocode$long) + .06
max_long <- max(Manhattan_geocode$long)

min_lat <- min(Manhattan_geocode$lat) + .05
max_lat <- max(Manhattan_geocode$lat)

# Plot the following points
ggplot() + 
  geom_point(data=Manhattan_geocode, aes(x=long, y = lat), alpha = .8, size=.03, shape = 16, color = "orange")+
  scale_y_continuous(limits=c(min_lat,max_lat)) +
  scale_x_continuous(limits=c(min_long,max_long)) +  
  annotate("text", x= -74.05, y = 40.85, label = "Manhattan Parking Citations \n2022/06 - 2022/10", col = "white", family = "mono", fontface="bold", hjust=0) +
   annotate("text", x= max_long-.03, y = min_lat, label = "Using 64,000+ data points", col = "orange", family = "mono", fontface="bold", hjust=0, size=2.5) +
  theme_void()+
  theme(panel.background=element_rect(fill='black'))

#ggsave("manhattan.png")

We can almost map out all of the streets in Manhattan!

Mapping Brooklyn using Point w/o map overlay:

brooklyn_geocode <- flat_csv_long_lat[grep("Brooklyn,NY", flat_csv_long_lat$Full_Address), ]
dim(brooklyn_geocode)
## [1] 133018      3
#For centering ggplot
min_long <- min(brooklyn_geocode$long) 
max_long <- max(brooklyn_geocode$long) 

min_lat <- min(brooklyn_geocode$lat) 
max_lat <- max(brooklyn_geocode$lat) - .10

#plot the given points
ggplot() + 
  geom_point(data=brooklyn_geocode, aes(x=long, y = lat), alpha = .8, size=.03, shape = 16, color = "orange")+
  scale_y_continuous(limits=c(min_lat,max_lat))   + 
  annotate("text", x= min_long , y = max_lat, label = "Brooklyn Parking Citations \n2022/06 - 2022/10", col = "white", family = "mono", fontface="bold", hjust=0) +
  annotate("text", x= max_long-.05, y = min_lat, label = "Using 130,000+ data points", col = "orange", family = "mono", fontface="bold", hjust=0, size=2.5) +
  theme_void() +
  theme(panel.background=element_rect(fill='black'))

#ggsave("Brooklyn.png")

Tidygeocoder

Since the batch querying was done once to reduce the time for reruns, here is an example of the output of tidygeocoder with small data:

examples_address <- tibble(line_address= c("1700 CROTONA AVE,Bronx,NY", "82 UNION ST,Brooklyn,NY", "99 EARHART LN,Bronx,NY"))
examples_address <- examples_address %>% 
    tidygeocoder::geocode(address = line_address, method = "census", verbose = TRUE)
## 
## Number of Unique Addresses: 3
## Executing batch geocoding...
## Batch limit: 10,000
## Passing 3 addresses to the US Census batch geocoder
## Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/addressbatch
## Passing the following parameters to the API:
## format : "json"
## benchmark : "Public_AR_Current"
## vintage : "Current_Current"
## Query completed in: 1.2 seconds
examples_address

Further Improvement:

As discussed earlier, geocoding the address is a computationally expensive task. It will be more expensive if we scale up the project to a billion points. Suppose we want to aggregate all parking tickets post 2010, the amount of data to process will be expensive. However, the map will be more visually appealing; we can map out Manhattan with more density!

There are other implementations that connects the ticket to the “centreline” of the given city and then compute the centroid of the street. The computation is more efficient but does not compute the exactness.

Likewise, we avoided using Intersection between two street names. Tidygeocoder has the capability of mapping intersection. However, after several experiments, batch encoding with method = “census” did not work as expected with the given dataset (it may work better with other dataset). I was only able to get about 5-10% to return a lat and long through batch encoding. Using method = “arcgis” provide superior results by giving a lat and long, but does not support batch encoding. Using arcgis would be computationally expensive even if it allows us to create a more “completed visualized map”.

Updated: