6 Florida Crime Analytics

6.1 Introduction

In this report, we analyze county-level crime data in Florida to explore the relationship between crime rates and important socioeconomic factors. Specifically, we look at how crime relates to median income, high school graduation rates, and urban population, and determine which of these variables best predicts county-level crime rates across the state.

6.2 Required Packages

First, we load the packages used in this report.

library(tidyverse)
library(readxl)
library(dplyr)
library(stringr)
library(knitr)
library(ggplot2)
library(ggpubr)
library(ggcorrplot)
library(AICcmodavg)
library(broom)

6.3 Loading and Preparing the Data

florida_crime <- read_xlsx("Florida County Crime Rates.xlsx")

florida_crime <- florida_crime %>%
  rename(Crime = C,
         Income = I,
         HighSchoolGrad = HS,
         UrbanPop = U) %>% 
  mutate(County = str_to_title(tolower(County)))

summary(florida_crime)

    County              Crime           Income      HighSchoolGrad 
 Length:67          Min.   :  0.0   Min.   :15.40   Min.   :54.50  
 Class :character   1st Qu.: 35.5   1st Qu.:21.05   1st Qu.:62.45  
 Mode  :character   Median : 52.0   Median :24.60   Median :69.00  
                    Mean   : 52.4   Mean   :24.51   Mean   :69.49  
                    3rd Qu.: 69.0   3rd Qu.:28.15   3rd Qu.:76.90  
                    Max.   :128.0   Max.   :35.60   Max.   :84.90  
    UrbanPop    
 Min.   : 0.00  
 1st Qu.:21.60  
 Median :44.60  
 Mean   :49.56  
 3rd Qu.:83.55  
 Max.   :99.60

First, we loaded the dataset from Excel, renamed the columns to standardize them (Crime, Income, HighSchoolGrad, UrbanPop), and reformatted the county names so only the first letter is capitalized for consistency. Then, we used summary() to get a general overview of the dataset.

6.4 Exploratory Data Analysis

6.4.1 Basic descriptive statistics (mean, median, range)

descriptive_stats <- data.frame(
  Statistic = c("Mean", "Median", "Range"),
  Crime = c(mean(florida_crime$Crime),
            median(florida_crime$Crime),
            diff(range(florida_crime$Crime))),
  Income = c(mean(florida_crime$Income),
             median(florida_crime$Income),
             diff(range(florida_crime$Income))),
  HighSchoolGrad = c(mean(florida_crime$HighSchoolGrad),
                     median(florida_crime$HighSchoolGrad),
                     diff(range(florida_crime$HighSchoolGrad))),
  UrbanPop = c(mean(florida_crime$UrbanPop),
               median(florida_crime$UrbanPop),
               diff(range(florida_crime$UrbanPop))))

knitr::kable(descriptive_stats)

Table 6.1: Basic descriptive statistics (mean, median, and range) for crime rate, median income, high school graduation rate, and urban population percentage.

Statistic	Crime	Income	HighSchoolGrad	UrbanPop
Mean	52.40299	24.51045	69.48955	49.55821
Median	52.00000	24.60000	69.00000	44.60000
Range	128.00000	20.20000	30.40000	99.60000

Then, we calculated basic descriptive statistics (mean, median, and range) for all numeric variables in the dataset — Crime, Income, HighSchoolGrad, and UrbanPop and organized the results into a table for easier comparison.

6.4.2 Scatterplot: Median income (in thousands) + Crime rate (per 1,000 residents)

ggplot(florida_crime, aes(x = Income, y = Crime)) +
  geom_point (size = 3) + 
  geom_smooth(method = "lm", se = FALSE) +
  stat_cor(method = "pearson") +
  labs(title = "Relationship Between Crime rate and Median income",
       x = "Median income (in thousands)", 
       y = "Crime rate (per 1,000 residents)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(size = 17, family = "Georgia", face = "bold"),
        axis.title.x = element_text(size = 12, family = "Georgia"),
        axis.title.y = element_text(size = 12, family = "Georgia"))

Figure 6.1: Scatterplot showing the relationship between median household income (in thousands of dollars) and crime rate (per 1,000 residents). Each point represents a county, with the fitted regression line illustrating the overall trend between income level and crime rate.

6.4.2.1 Interpretation:

The scatter plot shows a moderate positive relationship between median income and crime rate (r ≈ 0.43). As income increases, crime rates also tend to rise slightly, although there is still some variation.

6.4.3 Scatterplot: Urban population (%) + Crime rate (per 1,000 residents)

ggplot(florida_crime, aes(x = UrbanPop, y = Crime)) +
  geom_point (size = 3) + 
  geom_smooth(method = "lm", se = FALSE) +
  stat_cor(method = "pearson") +
  labs(title = "Relationship Between Crime rate and Urban population",
       x = "Urban population (%)", 
       y = "Crime rate (per 1,000 residents)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(size = 17, family = "serif", face = "bold"),
        axis.title.x = element_text(size = 12, family = "serif"),
        axis.title.y = element_text(size = 12, family = "serif"))

Figure 6.2: Scatterplot showing the relationship between urban population percentage and crime rate (per 1,000 residents), with a fitted linear trend line and Pearson correlation coefficient.

6.4.3.1 Interpretation:

The scatter plot shows a fairly strong positive relationship between urban population percentage and crime rate (r ≈ 0.68). This suggests that counties with a larger urban population tend to have higher crime rates. The trend line supports this pattern.

6.5 Correlation Analysis

6.5.1 Correlation matrix:

cor_matrix <- cor(florida_crime[,c("Crime","Income","HighSchoolGrad","UrbanPop")])
cor_matrix

                   Crime    Income HighSchoolGrad  UrbanPop
Crime          1.0000000 0.4337503      0.4669119 0.6773678
Income         0.4337503 1.0000000      0.7926215 0.7306983
HighSchoolGrad 0.4669119 0.7926215      1.0000000 0.7907190
UrbanPop       0.6773678 0.7306983      0.7907190 1.0000000

6.5.2 Visualization:

ggcorrplot(cor_matrix,
           lab = TRUE,
           title = "Correlation Matrix: Crime, Income, HighSchoolGrad, UrbanPop",
           lab_size = 3,
           colors=)

Figure 6.3: Correlation matrix showing the relationships among Crime, Income, HighSchoolGrad, and UrbanPop. Cell values represent Pearson correlation coefficients, with color indicating the strength and direction of each relationship.

6.5.2.1 Interpretation:

All variables show a positive relationship with crime. Income (r ≈ 0.43) and high school graduation rate (r ≈ 0.47) have moderate positive correlations, while urban population (r ≈ 0.68) shows a strong positive correlation and is the strongest predictor of crime out of the variables.

6.6 Building Regression Models

6.6.1 Building a simple regression model:

model_income <- lm(Crime ~ Income, data = florida_crime)
summary(model_income)


Call:
lm(formula = Crime ~ Income, data = florida_crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.452 -21.347  -3.102  17.580  69.357 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.6059    16.7863  -0.691 0.491782    
Income        2.6115     0.6729   3.881 0.000246 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.6 on 65 degrees of freedom
Multiple R-squared:  0.1881,    Adjusted R-squared:  0.1756 
F-statistic: 15.06 on 1 and 65 DF,  p-value: 0.0002456

6.6.2 Building multiple regression models:

model_income_highschoolgrad <- lm(Crime ~ Income + HighSchoolGrad, data = florida_crime)
model_income_urbanpop <- lm(Crime ~ Income + UrbanPop, data = florida_crime)
model_all <- lm(Crime ~ Income + HighSchoolGrad + UrbanPop, data = florida_crime)

summary(model_income_highschoolgrad)


Call:
lm(formula = Crime ~ Income + HighSchoolGrad, data = florida_crime)

Residuals:
   Min     1Q Median     3Q    Max 
-42.75 -19.61  -4.57  18.52  77.86 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)    -46.1094    24.9723  -1.846   0.0695 .
Income           1.0311     1.0839   0.951   0.3450  
HighSchoolGrad   1.0540     0.5729   1.840   0.0705 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.14 on 64 degrees of freedom
Multiple R-squared:  0.2289,    Adjusted R-squared:  0.2048 
F-statistic:   9.5 on 2 and 64 DF,  p-value: 0.000244

summary(model_income_urbanpop)


Call:
lm(formula = Crime ~ Income + UrbanPop, data = florida_crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.130 -15.590  -6.484  16.595  48.921 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.9723    16.3536   2.444   0.0173 *  
Income       -0.7906     0.8049  -0.982   0.3297    
UrbanPop      0.6418     0.1110   5.784 2.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.91 on 64 degrees of freedom
Multiple R-squared:  0.4669,    Adjusted R-squared:  0.4502 
F-statistic: 28.02 on 2 and 64 DF,  p-value: 1.815e-09

summary(model_all)


Call:
lm(formula = Crime ~ Income + HighSchoolGrad + UrbanPop, data = florida_crime)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.407 -15.080  -6.588  16.178  50.125 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     59.7147    28.5895   2.089   0.0408 *  
Income          -0.3831     0.9405  -0.407   0.6852    
HighSchoolGrad  -0.4673     0.5544  -0.843   0.4025    
UrbanPop         0.6972     0.1291   5.399 1.08e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.95 on 63 degrees of freedom
Multiple R-squared:  0.4728,    Adjusted R-squared:  0.4477 
F-statistic: 18.83 on 3 and 63 DF,  p-value: 7.823e-09

6.6.3 Comparing R², Adjusted R², AIC values:

AIC(model_income, model_income_highschoolgrad, model_income_urbanpop, model_all) %>%
  as.data.frame() %>%
  dplyr::arrange(AIC)

                            df      AIC
model_income_urbanpop        4 602.4276
model_all                    5 603.6764
model_income_highschoolgrad  4 627.1524
model_income                 3 628.6045

models <- list(model_income, model_income_highschoolgrad, model_income_urbanpop,model_all)

names(models) <- c("model_income", "model_income_highschoolgrad", "model_income_urbanpop", 
                   "model_all")

purrr::map_dfr(models, broom::glance, .id = "model") %>%
  dplyr::select(model, r.squared, adj.r.squared, p.value, AIC)

# A tibble: 4 × 5
  model                       r.squared adj.r.squared       p.value   AIC
  <chr>                           <dbl>         <dbl>         <dbl> <dbl>
1 model_income                    0.188         0.176 0.000246       629.
2 model_income_highschoolgrad     0.229         0.205 0.000244       627.
3 model_income_urbanpop           0.467         0.450 0.00000000181  602.
4 model_all                       0.473         0.448 0.00000000782  604.

6.6.3.1 Interpretation:

When comparing models, the one with Income and UrbanPop (model_income_urbanpop) has the lowest AIC (602), meaning it provides the best balance between accuracy and simplicity. It also has a relatively high adjusted R² (0.45), showing it explains a good portion of the variance in crime without unnecessary complexity. The full model (adding HighSchoolGrad) has a slightly lower adjusted R² (0.448) and a higher AIC (604), indicating that adding HighSchoolGrad does not meaningfully improve the model. Therefore, UrbanPop appears to be the most influential predictor, and the model with Income + UrbanPop is the best overall in terms of simplicity and predictive performance.

6.6.3.2 Therefore,

counties with larger urban populations tend to have higher crime rates. For every 1% increase in the urban population, the predicted crime rate increases by about 0.56 incidents per 1,000 residents, holding income constant. This relationship is positive, moderately strong (adjusted R² ≈ 0.45), and statistically significant (p < 0.001). This suggests that more urbanized areas in Florida experience higher crime, possibly due to greater population density or urban conditions.

Key Finding

The model that best predicts county-level crime rates includes both median income and urban population. Together, these variables explain a substantial portion of the variation in crime across counties (Adjusted R² ≈ 0.45) while maintaining model simplicity. Among them, urban population is the strongest individual predictor, showing the highest correlation with crime and remaining significant across models.

6.7 Communicating Findings

After looking at the Florida county crime data, we found that the model that predicts crime the best includes both median income and urban population. This model explains about 45% of the differences in crime rates across counties (Adjusted R² ≈ 0.45) and also has the lowest AIC, meaning it fits well without being too complex. Urban population turned out to be the strongest predictor, it has the highest correlation with crime (r ≈ 0.68) and stays significant in every model. Ultimately, this means that counties with more people living in urban areas tend to have higher crime rates. Because of this, it might make sense for the Florida Police Department to focus more resources and prevention in highly urban counties. One limitation is that, like any correlation analysis, it does not prove causation. Additionally, this model does not account for other important socioeconomic factors, such as unemployment or poverty, which could also influence crime rates.

6.8 Final Comment

Lastly, based on our analysis, the model that best predicts Florida’s county-level crime rates is the one using both median income and urban population (Crime ~ Income + UrbanPop). This model has the lowest AIC and is able to explain almost half of the differences in crime rates between counties. Urban population is the strongest predictor, and income slightly improves the model’s accuracy, making this combination the most effective for understanding crime differences across counties.