INTRODUCTION: This dataset was found online at https://data.gov.au/dataset/ds-qld-918fdafc-5cc5-493d-a6d4-1aa128bdd075/details?q=Crime
Summary of what the data set is:
This data relates to offenses in police districts. In Queensland, offender statistics from the police are based on the number of offenses cleared, not on individual offenders. The data counts the number of offenses solved through action against an offender. Only individuals ten years and older are considered offenders. Juvenile offenders are aged between ten and sixteen years. Offenders without specified age or gender in police records are not included. There are 15 counties included in this data set.
Focus of this project:
I wanted to specifcally study offeders who commited drug crimes. Out of nearly 100 attributes I kept around 10 that were relevant to examining people charged with drug crimes in Queensland AU. This information spans from JAN of 2001 to FEB of 2024.
The categories I ended up keeping were 'District', 'Month Year', 'Age', 'Sex', 'Drug Offences', 'Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences' . This gave all the quanitative drug information as well as all the categorical data of each offender.
Import the Proper Libraries to do this project
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_24214/2660915427.py:2: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
PREPROCESSING: This dataset needed to modified to include just drug crimes and categorical information. Missinging values needed to checked for as well in the data. Some category outputs needed to simplified for the sake of simplicity. The following below explains the preprocessing steps that were taken.
The following is the steps I took to get the original data set to work as a CSV. Renanme original .csv to a .txt file, do not open as a dataframe and then convert, fix errors in the file so it can write properly to .csv and then processed. This is taken care of in the final submission.
# Read the contents of the original file, convert from .csv to .txt before executing this line
with open('district_Reported_Offenders_Number.txt', 'r') as file:
lines = file.readlines()
# Insert numbers at the beginning of each line starting and remove last value from the second line onwards of every line of the .txt file
# This fixes an error in the data set where the final quantitative value is set in line with a linear number line
for i in range(1, len(lines)):
lines[i] = f"{i},{','.join(lines[i].split(',')[:-1])}\n"
# Write the modified content back to the file, the file is done and needs to be saved before being converted to .csv
with open('modified_file.txt', 'w') as file:
file.writelines(lines)
Convert .txt to .csv file so it can then be turned into a dataframe, takes advantage of .csv settings
# Read the .txt file into a DataFrame
df = pd.read_csv('modified_file.txt')
# Write the DataFrame to a .csv file
df.to_csv('district_crimes.csv', index=False)
Read in the .csv file for your data. Test out displaying it and get the size of the dataset
crimes = pd.read_csv("district_crimes.csv")
print(crimes.shape)
cM, cN = crimes.shape
crimes.head()
(25020, 93)
District | Month Year | Age | Sex | Homicide (Murder) | Other Homicide | Attempted Murder | Conspiracy to Murder | Manslaughter (excl. by driving) | Manslaughter Unlawful Striking Causing Death | ... | Fare Evasion | Public Nuisance | Stock Related Offences | Traffic and Related Offences | Dangerous Operation of a Vehicle | Drink Driving | Disqualified Driving | Interfere with Mechanism of Motor Vehicle | Miscellaneous Offences | Other Offences | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Capricornia | JAN01 | Adult | Female | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 16 | 0 | 11 | 0 | 9 | 2 | 0 | 0 | 70 |
1 | Capricornia | JAN01 | Adult | Male | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2 | 70 | 0 | 126 | 6 | 98 | 21 | 1 | 1 | 415 |
2 | Capricornia | JAN01 | Adult | Not Stated | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Capricornia | JAN01 | Juvenile | Female | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 |
4 | Capricornia | JAN01 | Juvenile | Male | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 41 |
5 rows × 93 columns
Create the drug_crimes data frame that isolates the categorical data of .csv with the quanitative dru offences data
# this only for drug crime data
columns_to_keep1 = ['District', 'Month Year', 'Age', 'Sex', 'Drug Offences', 'Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences']
# Create a new DataFrame 'drug_crimes' with only the specified columns
drug_crimes = crimes[columns_to_keep1].copy()
# Filter the DataFrame to include only rows where at least one of the specified columns has a value greater than 0
drug_crimes= drug_crimes[(drug_crimes['Drug Offences'] > 0)]
# drug offences is just a total of all other drug offences, if it 0 then the individual has commited no crimes related to drugs
#print(drug_crimes.columns)
print(drug_crimes.shape)
# get the bounds of the new dataset
dcN, dcM = drug_crimes.shape
drug_crimes.head()
(16531, 10)
District | Month Year | Age | Sex | Drug Offences | Trafficking Drugs | Possess Drugs | Produce Drugs | Sell Supply Drugs | Other Drug Offences | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Capricornia | JAN01 | Adult | Female | 32 | 0 | 15 | 2 | 1 | 14 |
1 | Capricornia | JAN01 | Adult | Male | 102 | 0 | 51 | 10 | 2 | 39 |
3 | Capricornia | JAN01 | Juvenile | Female | 3 | 0 | 1 | 0 | 0 | 2 |
4 | Capricornia | JAN01 | Juvenile | Male | 9 | 0 | 5 | 1 | 0 | 3 |
6 | Capricornia | FEB01 | Adult | Female | 28 | 0 | 8 | 2 | 1 | 17 |
More Preprocessing analysis
Display the type of each column in the data frame
for columns in drug_crimes:
print(columns + ": "+ str(drug_crimes[columns].dtype))
District: object Month Year: object Age: object Sex: object Drug Offences: int64 Trafficking Drugs: int64 Possess Drugs: int64 Produce Drugs: int64 Sell Supply Drugs: int64 Other Drug Offences: int64
Detect Missing Values and Negative Values in the Data
# Check for missing values in each column
missing_values = drug_crimes.isnull().any()
print("Is there any missing values in all columns: ")
print(missing_values)
print("")
#print(drug_crimes.iloc[:, 4:10])
#check for negative values in any of the columns
negative_values = (drug_crimes.iloc[:, 4:10] < 0).any()
print("Is there any negative values in the quanitative columns: ")
print(negative_values)
print("")
Is there any missing values in all columns: District False Month Year False Age False Sex False Drug Offences False Trafficking Drugs False Possess Drugs False Produce Drugs False Sell Supply Drugs False Other Drug Offences False dtype: bool Is there any negative values in the quanitative columns: Drug Offences False Trafficking Drugs False Possess Drugs False Produce Drugs False Sell Supply Drugs False Other Drug Offences False dtype: bool
Count the total amount of unique categorical outputs for each column that has a categorical type
unique_counts = [] # store each categorical type
# run a for loop through all of the categorical data
for column in drug_crimes[['District', 'Month Year', 'Age', 'Sex']]:
unique_values = drug_crimes[column].nunique()
unique_counts.append(unique_values)
# print out all of the categorical data
print("Number of unique values for each column:")
for i, column in enumerate(drug_crimes[['District', 'Month Year', 'Age', 'Sex']].columns):
print(f"{column}: {unique_counts[i]}")
Number of unique values for each column: District: 15 Month Year: 278 Age: 2 Sex: 3
Explanation:
Month Year: goes from January of 2001 (JAN01) to Febuary (FEB24), that is 23 full years 23*12=276, add 1 month and the fact that JAN01 is counted as 1 and not 0 gets 278 total unique outputs.
Distircts: There is 15 seperate districts counted in this dataset for the area of Queensland Australia
Age: The ages are specificed by Adult or Juvenile
Sex: This is specificed by Male, Female, or Not Stated. It makes sense that there is 3
See how much of the data the category 'Sex' has as 'Not Stated' in it, and see if we should remove it
count_not_stated = (drug_crimes.iloc[:, 3] == "Not Stated").sum()
print("Number of times 'Not Stated' appears in the column:", count_not_stated)
Number of times 'Not Stated' appears in the column: 282
Update the data frame to exclude entries where the Sex of the criminal is "Not Stated"
drug_crimes = drug_crimes[drug_crimes["Sex"] != "Not Stated"]
fN, fM = drug_crimes.shape
difference = dcN - fN
#print(difference) difference is 282
Lets simplify the data from the Category Month Year to just Year, this will change for example JAN01 and FEB01 to just 01. This spans 23 years without including the month.
# Assuming 'Month Year' is a string column in the 'drug_crimes' DataFrame
drug_crimes['Year'] = drug_crimes['Month Year'].str[-2:]
# Drop the 'Month Year' column if you no longer need it
drug_crimes.drop('Month Year', axis=1, inplace=True)
See final categorical outputs for the data
unique_counts = [] # store each categorical type
# run a for loop through all of the categorical data
for column in drug_crimes[['District', 'Year', 'Age', 'Sex']]:
unique_values = drug_crimes[column].nunique()
unique_counts.append(unique_values)
# print out all of the categorical data
print("Number of unique values for each column:")
for i, column in enumerate(drug_crimes[['District', 'Year', 'Age', 'Sex']].columns):
print(f"{column}: {unique_counts[i]}")
drug_crimes.head()
Number of unique values for each column: District: 15 Year: 24 Age: 2 Sex: 2
District | Age | Sex | Drug Offences | Trafficking Drugs | Possess Drugs | Produce Drugs | Sell Supply Drugs | Other Drug Offences | Year | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Capricornia | Adult | Female | 32 | 0 | 15 | 2 | 1 | 14 | 01 |
1 | Capricornia | Adult | Male | 102 | 0 | 51 | 10 | 2 | 39 | 01 |
3 | Capricornia | Juvenile | Female | 3 | 0 | 1 | 0 | 0 | 2 | 01 |
4 | Capricornia | Juvenile | Male | 9 | 0 | 5 | 1 | 0 | 3 | 01 |
6 | Capricornia | Adult | Female | 28 | 0 | 8 | 2 | 1 | 17 | 01 |
Perform Outlier analysis on all of the quanitative data
# Select quantitative columns
quantitative_columns = drug_crimes[['Drug Offences', 'Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences']]
# Calculate z-scores for each data point in the selected columns
z_scores = quantitative_columns.apply(zscore)
# Define a threshold for identifying outliers (e.g., z-score > 3 or < -3)
outliers = (z_scores > 3) | (z_scores < -3)
# Print the count of outliers for each column
print("Out of " + str(len(drug_crimes)) + " total recordings,")
print("The number of outliers in each quantitative column is listed below:")
print(outliers.sum())
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[13], line 5 2 quantitative_columns = drug_crimes[['Drug Offences', 'Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences']] 4 # Calculate z-scores for each data point in the selected columns ----> 5 z_scores = quantitative_columns.apply(zscore) 7 # Define a threshold for identifying outliers (e.g., z-score > 3 or < -3) 8 outliers = (z_scores > 3) | (z_scores < -3) NameError: name 'zscore' is not defined
Summary Data Analysis
Summary data analysis Make a statistical summary of every column using numerical and graphical components. Explore correlations between at least 3 pairs of quantitative columns.
summary = drug_crimes.describe()
print(summary)
Drug Offences Trafficking Drugs Possess Drugs Produce Drugs \ count 16249.000000 16249.000000 16249.000000 16249.000000 mean 75.867007 0.500892 35.193243 2.302849 std 102.185848 1.357915 49.305527 3.848647 min 1.000000 0.000000 0.000000 0.000000 25% 10.000000 0.000000 5.000000 0.000000 50% 32.000000 0.000000 15.000000 1.000000 75% 103.000000 0.000000 46.000000 3.000000 max 886.000000 24.000000 457.000000 33.000000 Sell Supply Drugs Other Drug Offences count 16249.000000 16249.000000 mean 2.376700 35.493323 std 4.825648 46.833235 min 0.000000 0.000000 25% 0.000000 5.000000 50% 1.000000 15.000000 75% 3.000000 50.000000 max 185.000000 401.000000
More Data Analysis, the 'Drug Offences' category is a total of the other 5 drug offences categories and should not be included as correlation comparison of the data, We are finding correlations between the quanitative data using a correlation matrix
# Select columns for correlation analysis
cols = ['Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences']
corr = drug_crimes[cols].corr()
# Create a heatmap
sns.heatmap(corr, annot=True, cmap='plasma')
plt.title('Correlation Matrix')
plt.show()
From this plot we can see the only crimess that are highly correlated are drugs crimes between Possession and Other Drug Offences. It is probably likely to assume that people that are charged with Possession in AU are probably charged with less categorically relevant drug crimes.
Additionally there is a moderately positive relationship between:
- Possession and Sell Supply of Drugs
- Possession and Production of Drugs
- Other Drug Offences and the Production of Drugs
- Other Drug Offences and the Sell Supply of Drugs
I would conclude with the fact that there is no other signiifcant correlations between the other types of drug offences in the data
The 5 Correlations I deemed signficant by being greater than 0.6 are graphed below as scatter plots comparing the correlations to each catergory visually. They are in the order occourence in the previous markdown cell
# Create a figure with 5 subplots
fig, axs = plt.subplots(3, 2, figsize=(20, 15))
# Function to calculate and display correlation coefficient
def annotate_correlation(x, y, ax):
correlation = np.corrcoef(x, y)[0, 1]
ax.annotate(f'Correlation: {correlation:.2f}', xy=(0.5, 0.9), xycoords='axes fraction', ha='center')
# Scatter plot 1: Other Drug Offences vs Possess Drugs
ax = axs[0, 0]
ax.scatter(drug_crimes['Other Drug Offences'], drug_crimes['Possess Drugs'])
ax.set_title('Other Drug Offences vs Possess Drugs')
ax.set_xlabel('Other Drug Offences')
ax.set_ylabel('Possess Drugs')
annotate_correlation(drug_crimes['Other Drug Offences'], drug_crimes['Possess Drugs'], ax)
# Scatter plot 2: Possess Drugs vs Sell Supply Drugs
ax = axs[0, 1]
ax.scatter(drug_crimes['Possess Drugs'], drug_crimes['Sell Supply Drugs'])
ax.set_title('Possess Drugs vs Sell Supply Drugs')
ax.set_xlabel('Possess Drugs')
ax.set_ylabel('Sell Supply Drugs')
annotate_correlation(drug_crimes['Possess Drugs'], drug_crimes['Sell Supply Drugs'], ax)
# Scatter plot 3: Possess Drugs vs Produce Drugs
ax = axs[1, 0]
ax.scatter(drug_crimes['Possess Drugs'], drug_crimes['Produce Drugs'])
ax.set_title('Possess Drugs vs Produce Drugs')
ax.set_xlabel('Possess Drugs')
ax.set_ylabel('Produce Drugs')
annotate_correlation(drug_crimes['Possess Drugs'], drug_crimes['Produce Drugs'], ax)
# Scatter plot 4: Other Drug Offences vs Produce Drugs
ax = axs[1, 1]
ax.scatter(drug_crimes['Other Drug Offences'], drug_crimes['Produce Drugs'])
ax.set_title('Other Drug Offences vs Produce Drugs')
ax.set_xlabel('Other Drug Offences')
ax.set_ylabel('Produce Drugs')
annotate_correlation(drug_crimes['Other Drug Offences'], drug_crimes['Produce Drugs'], ax)
# Scatter plot 5: Other Drug Offences vs Sell Supply Drugs
ax = axs[2, 0]
ax.scatter(drug_crimes['Other Drug Offences'], drug_crimes['Sell Supply Drugs'])
ax.set_title('Other Drug Offences vs Sell Supply Drugs')
ax.set_xlabel('Other Drug Offences')
ax.set_ylabel('Sell Supply Drugs')
annotate_correlation(drug_crimes['Other Drug Offences'], drug_crimes['Sell Supply Drugs'], ax)
# Adjust layout
plt.tight_layout()
# Show the plots
plt.show()
Bar Graphs comparing the difference in the total amount of drug offences commited by Adults vs. Juvniles in Queensland AU over the last 23 years.
# Specify the quantitative columns, excluded the 'Drug Offences' category because it is just a total of the other 5
quantitative_columns = ['Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs', 'Other Drug Offences']
# Group by 'Age' and sum the total number of offences for each quantitative column
# A lambda function is used to tell if an offence was commited, not how many total offence each indiviudal has commmited
offences_by_age = drug_crimes.groupby('Age')[quantitative_columns].apply(lambda x: (x != 0).sum())
# Plotting the bar graph
ax = offences_by_age.plot(kind='bar')
plt.title('Distribution of Total Number of Offences by Age and Offence Type (From JAN01-FEB24)')
plt.xlabel('Age')
plt.ylabel('Total Number of Offences')
plt.xticks(rotation=0)
# Move the legend to the upper right corner
plt.legend(title='Offence Type', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.show()
Bar Graphs comparing the difference in the total amount of drug offences commited by Males vs. Females in Queensland AU over the last 23 years.
# Assuming 'drug_crimes' DataFrame is already loaded with relevant data
# Specify the quantitative columns
quantitative_columns = [ 'Trafficking Drugs',
'Possess Drugs', 'Produce Drugs',
'Sell Supply Drugs', 'Other Drug Offences'
]
# Group by 'Age' and sum the total number of offences for each quantitative column
offences_by_age = drug_crimes.groupby('Sex')[quantitative_columns].apply(lambda x: (x != 0).sum())
# Plotting the bar graph
offences_by_age.plot(kind='bar')
plt.title('Distribution of Total Number of Offences by Sex and Offence Type (From JAN01-FEB24)')
plt.xlabel('Sex')
plt.ylabel('Total Number of Offences')
plt.xticks(rotation=0)
# Move the legend to the upper right corner
plt.legend(title='Offence Type', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.show()
Bar Chart Dividing up between Males and Female, in order to directly compare by Age how many more Adult vs. Juvenile drug offences there are for each deug offence type.
import matplotlib.pyplot as plt
# Define the quantitative columns
quantitative_columns = ['Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs']
# Group by 'Age' and 'Sex', and count the number of non-zero entries for each quantitative column
offences_by_age_sex = drug_crimes.groupby(['Sex', 'Age'])[quantitative_columns].apply(lambda x: (x != 0).sum())
# Plot the bar chart for 'Age' and 'Sex'
offences_by_age_sex.unstack().plot(kind='bar', figsize=(12, 8), width=0.8)
plt.title('Number of Drug Offences Divided by Sex and Compared Between Age Groups (From JAN01-FEB24)')
plt.xlabel('Age Group')
plt.ylabel('Number of Drug Offences (From JAN01-FEB24)')
plt.xlabel('Sex')
plt.xticks(rotation=0)
# Move the legend to the upper right corner
plt.legend(title='Offence Type', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.show()
Discussion:
Make a Quantitative and Categorical Prediction
Can we use the following categories 'Trafficking Drugs', 'Possess Drugs', 'Produce Drugs', 'Sell Supply Drugs' to predict the total amount of 'Drug Offences' an individual has commited?
Can we use the 6 quantitative drug crimes columns to classify with 0.9 F1 score whether or not a drug offender's 'Sex' is Male or Female?