# Lab 2

In this lab, you'll perform some directed data analysis on two datasets. The first is a dataset of service requests from New York City, reported to a 311 citizen hotline. The second is a dataset of Washington State employee salaries.

Many code cells are one-liners, while a few might require as many as 3 or 4. If you find yourself using more lines than that, you should probably look for a simpler approach.


## Part 1: New York City 311 Data

A little setup - import pandas and define the url where we'll pull our data from:

In [None]:
import pandas as pd
complaints_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/311-service-requests.csv"

Read in the dataset of 311 (citizen hotline - no relation to our course number) requests:

In [None]:
complaints = pd.read_csv(complaints_url, low_memory=False)
complaints

### Basic Selection

1.1 Display the first 8 rows of the dataframe.

1.2 Extract and display just the "Complaint Type" column.

1.3 Combine the techniques of the above two questions to get the first 8 rows of just the complaint type column. Does it matter which order you select them in (column then rows, or rows then column)?

1.4 Extract a DataFrame containing only the "Complaint Type" and "Borough" columns.

1.5 Display a tally of how many of each complaint type appears in the dataframe.

### Which borough has the most noise complaints?

1.6 Create a new Series using == that stores True if the complaint type is equal to "Noise - Street/Sidewalk", and False otherwise. Assign it to a variable called `is_noise`.

1.7 Create a new DataFrame that contains only the noise complaints by indexing `complaints` with `is_noise`.

1.8 Display a summary of the noise complaints; one call should tell you at a glance how many complaints there were, how many unique zip codes there were, and the most common Descriptor associated with noise complaints.

1.9 Display the count of noise complaints for each borough.

So it looks like Manhattan has the most noise complaints. Not too surprising! But Manhattan might also just have more complaints overall, so let's look at the *fraction* of complaints that were noise complaints.

1.10 Calculate the total count of complaints (of all types) for each borough. Then divide the noise complaint counts by the total complaint counts to get the fraction of noise complaints. Finally, multiply the result by 100 and store in a DataFrame called `complaint_percents`.

So yep, it looks like Manhattan is just noisy. Who knew?

## Part 2: Washington State Employee Salary Data





This section  performs some directed analyses on a dataset obtained from the Washington State Fiscal Information website.
Washington state employee salary information is public by state law and one may obtain a spreadsheet of about 450,0000 salaries from the last five years by simply requesting the data in an email.
The first code cell below loads an Excel spreadsheet containing five years of data into a Pandas DataFrame.
Additional information about the dataset can be found in the [FAQ](https://fiscal.wa.gov/staffing/SalaryDataFAQ.pdf). Agency codes you may find useful in completing the lab
 can be found in the [Washington State Agency Codes directory](https://ofm.wa.gov/sites/default/files/public/accounting/singleaudit/2022/24_FY22_Washington_State_Agency_Codes_By_Agency_Assigned_Number.pdf). There are 9 columns in the
 dataset:
 1.  agy -  Integer agency code
 2. AgyTitle - Name of agency
 3. Name - Lastname, first name (All caps)
 4. JobTitle - Name of job (All caps)
 5. Sal2019 - 2019 salary
 6. Sal2020 - 2020 salary
 7. Sal2021 - 2021 salary
 8. Sal2022 - 2022 salary
 9. Sal2023 - 2023 salary

### Reading the Data

Use the `read_excel` function with its default arguments to load the data into a DataFrame. It's about 26 MB, so may take some time to load; as usual, try to avoid re-running this cell often.

In [None]:
df = pd.read_excel('https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/AnnualEmployeeSalary.xlsx')
df.head()

First, let’s explore the data to answer questions about state employee turnover.
A zero salary for a salary year column indicates that the individual did not work for that particular state agency that year.
For the purposes of our analyses, use the following definitions:

*   A ***new employee*** is someone who has worked for a particular agency in ***at least one year*** , but has not worked for the agency ***in any prior years***.
For example, Mekdes Abate in row 0 is a new employee in 2020, and Peter Abbarno in row 1 is a new employee in 2021.
Note that for this definition we cannot determine the number of new employees for 2019.

*   A ***permanent leave employee*** in a particular year is someone who, during the previous year, worked for the agency but did not work for the agency for ***all following years (for at least two years)***.
Mekdes Abate in row 0 is a permanent leave employee in 2021.
Note that for this definition we cannot determine the number of permanent leave employees for either 2019 or 2023.



There are two rows corresponding to Meynun Abdalla working for the House of Representatives. This is far from the only instance of an individual occupying different roles in the same agency over the course of several years (think promotions and lateral career
moves). Since we are concerned with employee turnover, it doesn’t seem right to treat Meynun as a new employee for the House of Representatives in both 2022 and 2023. For the purposes of a turnover analysis we should combine all rows which share the same agency and employee by summing the respective yearly salaries for those rows. With our original data loaded into the DataFrame df we can aggregate matching employee/agency rows with the following code snippet:

In [None]:
import numpy as np

# Make a column that is the cross product of name and agency
agys = df['agy'].unique()
name_agy = np.array([agy.strip() +'%' + name.strip()
for agy, name in df[['AgyTitle', 'Name']].to_numpy()])
df['nagy'] = name_agy
# Aggregate their pay for each year they worked at the same agency
agg = df.groupby('nagy')
agg = agg.aggregate({f'Sal{str(year)}': "sum" for year in range(2020, 2024)})
# Create agency column from nagy (short for name/agency) index
agg['agy'] = agg.index
agg['agy'] = agg['agy'].apply(lambda x: x.split('%')[0])

In [None]:
agg[agg['agy'] == 'House of Representatives'].head()

Notice the aggregated income data for Meynun Abdalla and Katherine Abernathy in the resulting DataFrame. Perform the following analyses using the aggregated DataFrame `agg`.

2.1 In total for all agencies, how many new employees were there in each year 2020-2023?

2.2 What is the average number of new employees from an agency for each year 2020-2023?


2.3 In total for all agencies, how many permanent leave employees were there from each year 2020-2022?


2.4 What is the average number of permanent leave employees from an agency for each year 2020-2022?


2.5 What percentage of state employees have worked for more than one agency over the years 2019-2023?


2.6 In a markdown cell, synthesize your findings from questions 2.1-2.5 to summarize trends in state employment over the years 2019-2023.

Next we will use the dataset to investigate what might be good state jobs to apply for.
**Exclude salaries less than $1000 from these analyses.**
We will look at median salary for each agency using AgyTitle field, i.e. the institution a person worked for.

2.7 List the top ten paid agencies (by median salary) for each year.

2.8 List the bottom ten paid agencies (by median salary) for each year. Did this change over time?

2.9 List the ten agencies with the most employees in the dataset. Do any of these overlap with the top ten paid agencies?

2.10 In a markdown cell discuss how the results of these analyses might inform a person’s decision to apply to prospective state employers.