In my current data science project, I’ve effectively harnessed the strengths of both GeoPy and clustering techniques to unlock profound insights into the geospatial aspects of my dataset. GeoPy, a robust Python library, has been instrumental in precisely geocoding extensive datasets, converting addresses into accurate latitude and longitude coordinates. This geocoding process is fundamental as it facilitates the visualization of data on geographical plots, offering a spatial context to the observed patterns and trends. Leveraging Python’s rich libraries, I’ve applied clustering algorithms to this geocoded data, notably utilizing the K-Means clustering technique from the scikit-learn library to group similar data points based on their geospatial attributes. The outcomes have been incredibly enlightening.

GeoPy’s Contribution: By using GeoPy, I achieved accurate geocoding of my datasets, enabling precise plotting of data points on maps, such as the United States of America map.

Clustering Analysis: Utilizing K-Means clustering and integrating GeoPy with DBSCAN, a density-based clustering method, I identified distinct clusters, revealing valuable geospatial insights.

Project Outcomes:

  1. Geospatial Customer Segmentation: Through clustering customer data, I successfully delineated distinct customer groups based on their geographical locations, providing vital insights into regional preferences and behaviors. This, in turn, informs targeted marketing strategies.
  2. Trend Identification: Clustering shed light on geospatial trends, highlighting areas of heightened activity or interest. Such insights are pivotal for informed decision-making, guiding resource allocation and expansion strategies.

In a recent class session, our professor introduced us to GeoPy, showcasing its functionality in geocoding by plotting data points on a map of the USA. A specific focus was given to California, where we explored the correlation between shootouts and crime rates. Through this exercise, we delved into clustering techniques, particularly emphasizing the DBSCAN method. The discussion extended to exploring questions around the dependency of shootouts on crime rates, considering factors like crime intensity in regions with varying crime rates. This class session sparked engaging discussions and raised intriguing questions, further enhancing our understanding of GeoPy and clustering’s potential applications.


Today, I was engrossed in researching Analysis of Variance (ANOVA), a powerful statistical tool renowned for comparing the means of two or more groups within a given dataset. Its principal objective is to determine whether significant differences exist among these group means. The methodology employed by ANOVA involves scrutinizing the variance within each group and juxtaposing it against the variance between groups. If the variation between groups notably outweighs the variation within groups, ANOVA aptly signals that at least one group’s mean significantly deviates from the others.

In various domains such as scientific research, quality control, and social sciences, this statistical test plays a pivotal role. By offering a p-value, ANOVA empowers researchers to gauge the statistical significance of observed differences. A p-value below a predetermined threshold (often 0.05) suggests that the observed differences are unlikely to be attributed to random chance, prompting further investigation.

ANOVA manifests in different forms, notably one-way ANOVA, which contrasts multiple groups within a single factor, and two-way ANOVA, designed to assess the impact of two independent factors. The outcomes derived from ANOVA act as compasses for decision-making, enabling researchers and analysts to draw meaningful conclusions and make well-informed choices in their respective fields.



Project 2 involves a comprehensive examination of two distinct datasets: “fatal-police-shootings-data” and “fatal-police-shootings-agencies,” each designed to fulfill specific analytical objectives. The first dataset, “fatal-police-shootings-data,” encompasses 19 columns and 8770 rows, spanning from January 2, 2015, to October 7, 2023. It is essential to note the presence of missing values in crucial columns such as threat type, flee status, and location details. Despite these gaps, this dataset offers a wealth of insights into fatal police shootings, encompassing critical aspects like threat levels, weapons involved, demographic information, and more.

On the other hand, the second dataset, “fatal-police-shootings-agencies,” is composed of six columns and 3322 rows. Similar to the first dataset, there are instances of missing data, specifically within the “oricodes” column. This dataset is focused on providing details about law enforcement agencies, encompassing identifiers, names, types, locations, and their roles in fatal police shootings.

To extract meaningful insights and draw informed conclusions, it’s imperative to consider the context and pose specific queries that align with the objectives of the analysis. Both datasets present a valuable opportunity to study and gain a deeper understanding of fatal police shootings, the associated law enforcement agencies, and the intricate interplay of factors involved.


In our class discussion, we delved into a disconcerting revelation presented by The Washington Post regarding fatal police shootings in the United States. Since 2015, a staggering 8,770 fatal police shootings have occurred, with a chilling 928 such incidents recorded in the past 12 months alone. What’s particularly troubling is the significant undercounting issue, with only a third of these incidents making their way into the FBI database due to reporting discrepancies and confusion among law enforcement agencies. Furthermore, the data underscores a distressing racial disparity, with Black Americans being fatally shot at more than twice the rate of White Americans, despite comprising only 14 percent of the population. Hispanic Americans are similarly disproportionately affected. This emphasizes a pressing need for systemic reforms and heightened accountability to bridge this glaring gap in police shootings. The database curated by The Washington Post, containing detailed records of these incidents, emerged as a crucial resource for our discussion, shedding light on the urgent need for change and serving as a reminder of this persistent issue in our society.

October 4

I wanted to share with you our recent progress in predicting diabetes rates based on two key independent variables—obesity rate and inactivity rate—using a multiple linear regression model.

Key Variables: In our model, obesity rate and inactivity rate are considered as independent variables (X), while diabetes rate is the dependent variable (y).

Dataset Division: To streamline the model assessment process, we divided our dataset into training and testing sets. This division ensures that our model is evaluated effectively.

Model Initialization and Fitting: We initialized a linear regression model and fitted it to the training data, utilizing the independent variables (obesity rate and inactivity rate). The model aims to capture the relationship between these variables and the prevalence of diabetes.

Coefficient Analysis: Following the model fitting, we analyzed the coefficients to understand how the independent factors, namely obesity rate and inactivity rate, influence the prevalence of diabetes. These coefficients provide valuable insights into the impact of each variable on the diabetes rate.

Intercept Interpretation: The intercept in our model represents the anticipated baseline rate of diabetes, providing a crucial reference point for our predictions.

Model Evaluation with R2 Score: To assess the model’s performance on the test set, we utilized the R2 score. The R2 score measures the proportion of variation in the dependent variable (diabetes rate) that can be explained by the independent variables (obesity rate and inactivity rate). A higher R2 score indicates a more accurate and reliable predictive model.

Our ongoing research aims to further refine this model and explore additional metrics for a comprehensive evaluation of its predictive capabilities.

October 2

During today’s discussion, we delved into a critical aspect of scientific research – understanding the distinction between “results” and “findings.” “Results” encompass the raw, concrete data derived from experiments or studies, comprising clear numbers and facts. On the other hand, “findings” extend beyond the data, involving the interpretation of these results, drawing conclusions, making connections, and explaining the significance of the acquired data. Our instructor emphasized that both “results” and “findings” play vital roles in effectively communicating and comprehending our scientific work. We also explored the concept of a capstone project, which offers an exciting opportunity for students to apply their knowledge to address real-world challenges. These projects typically demonstrate the culmination of our academic learnings and allow us to engage in genuine problem-solving. Our instructor highlighted the key components of a capstone project, focusing on the objectives we aim to achieve and the actions we will undertake to accomplish those goals. Capstone projects showcase our ability to integrate and utilize our knowledge for practical research and solution-oriented approaches. As a fascinating example, we discussed a specific capstone project centered around voice signals. In this particular project, we aim to develop an application capable of predicting an individual’s health based on their voice patterns. The prospect of merging healthcare and technology in this project is truly exciting, presenting a potential shift in how health assessments are conducted. Projects like these exemplify our dedication to making a meaningful impact in our respective fields by leveraging technology to enhance healthcare solutions.