Dismal Ascent: 2025 Federal Accessions
By popular demand, learn who entered the federal workforce and how between January 2025 and January 2026. Use the filters to drill down into specific agencies, subcomponents, occupations, demographics, and accession (hiring) categories.
Total Accessions
Average Salary
Average Prior Service (Years)
Accessions by Agency & Leadership Status (Top 20)
Prior Service by Accession Type
Colored box represents interquartile range with vertical bar marking the median. Whisker lines represent extreme values minus outliers, which are represented by dots. Missing box indicates insufficient data for statistical analysis.
Career Level v. Accession Type
Categorizes incoming hires by prior creditable service experience.
Distribution of Prior Service by Age Bracket
Whisker box plots showing the distribution of prior creditable service, partitioned by age bracket.
Accessions by Occupational Group & Series (Top 20)
The broad occupational groups with the highest volume of accessions. Hover over the segments to see the specific job series.
Accessions by Month and Type
Select Filtered Records
Click here to see how this data was pre-processed in Jupyter Notebook
Dismal Ascent: Data Pre-Processing
This notebook loads OPM accession data from multiple JSON files, filters out old records, cleans string artifacts, casts columns into native types (Dates, Booleans, Floats) for analysis in the Observable Framework, and exports the combined dataset as optimized Parquet and JSON lookup files.
1. Define I/O
import pandas as pd
import numpy as np
import glob
json_files = glob.glob("accessions_*.json")
output_file = "opm_accessions.parquet"
print(f"Found {len(json_files)} files to process:")
for f in json_files:
print(f" - {f}")
Found 13 files to process:
- accessions_202501_2_2026-04-12.json
- accessions_202502_3_2026-04-12.json
- accessions_202503_3_2026-04-12.json
- accessions_202504_3_2026-04-12.json
- accessions_202505_3_2026-04-12.json
- accessions_202506_3_2026-04-12.json
- accessions_202507_3_2026-04-12.json
- accessions_202508_3_2026-04-12.json
- accessions_202509_3_2026-04-12.json
- accessions_202510_3_2026-04-12.json
- accessions_202511_3_2026-04-12.json
- accessions_202512_2_2026-04-12.json
- accessions_202601_1_2026-04-12.json
2. Load JSON in Pandas
dataframes = []
for file in json_files:
try:
df = pd.read_json(file, lines=True)
dataframes.append(df)
print(f"Successfully loaded {file}")
except Exception as e:
print(f"Error loading {file}: {e}")
combined_df = pd.concat(dataframes, ignore_index=True)
print(f"\nTotal records loaded: {len(combined_df)}")
Successfully loaded accessions_202501_2_2026-04-12.json
Successfully loaded accessions_202502_3_2026-04-12.json
Successfully loaded accessions_202503_3_2026-04-12.json
Successfully loaded accessions_202504_3_2026-04-12.json
Successfully loaded accessions_202505_3_2026-04-12.json
Successfully loaded accessions_202506_3_2026-04-12.json
Successfully loaded accessions_202507_3_2026-04-12.json
Successfully loaded accessions_202508_3_2026-04-12.json
Successfully loaded accessions_202509_3_2026-04-12.json
Successfully loaded accessions_202510_3_2026-04-12.json
Successfully loaded accessions_202511_3_2026-04-12.json
Successfully loaded accessions_202512_2_2026-04-12.json
Successfully loaded accessions_202601_1_2026-04-12.json
Total records loaded: 137468
3. Remove Old Records
The OPM download files included records with personnel actions prior to January 2025. We're only interested in actions that took place between January 2025 and January 2026.
if "personnel_action_effective_date_yyyymm" in combined_df.columns:
pre_2025_mask = combined_df["personnel_action_effective_date_yyyymm"].astype(float).fillna(0).astype(int) < 202501
invalid_count = pre_2025_mask.sum()
print(f"Found {invalid_count} records from before January 2025. Removing them...")
# Keep only valid records
combined_df = combined_df[combined_df["personnel_action_effective_date_yyyymm"].astype(float) >= 202501].copy()
print(f"Remaining records: {len(combined_df)}")
Found 2211 records from before January 2025. Removing them...
Remaining records: 135257
4. Clean String Artifacts
The OPM data included a variety of missing or redacted fields, which complicates data processing and analysis.
missing_value_indicators = ["REDACTED", "INVALID", "NO DATA REPORTED", "NDR", "*", " "]
# Count the occurrences before they are replaced
values_to_nullify = combined_df.isin(missing_value_indicators).sum().sum()
records_affected = combined_df.isin(missing_value_indicators).any(axis=1).sum()
# Replace missing values
combined_df.replace(missing_value_indicators, np.nan, inplace=True)
# Print the results
print(f"Number of values nullified: {values_to_nullify}")
print(f"Number of records affected: {records_affected}")
Number of values nullified: 1310187
Number of records affected: 86580
5. Type Casting: Numeric Values
Standardizing data types for numeric values.
numeric_cols = [
"annualized_adjusted_basic_pay",
"length_of_service_years",
"count",
"supervisory_status_code",
"tenure_code",
"position_occupation_code",
]
for col in numeric_cols:
if col in combined_df.columns:
combined_df[col] = pd.to_numeric(combined_df[col], errors="coerce")
6. Type Casting: Date Values as Date32
Standardizing data types for date values; Personnel action dates set to the first of their respective months.
date_cols = [
"appointment_not_to_exceed_date",
"service_computation_date_leave"
]
for col in date_cols:
if col in combined_df.columns:
combined_df[col] = pd.to_datetime(combined_df[col], errors="coerce").dt.date
# Parse the YYYYMM string into a standard Date object
if "personnel_action_effective_date_yyyymm" in combined_df.columns:
combined_df["personnel_action_effective_date_yyyymm"] = pd.to_datetime(
combined_df["personnel_action_effective_date_yyyymm"].astype(str),
format="%Y%m",
errors="coerce"
).dt.date
7. Remove Wrong SCDs
Convert to pandas datetime to easily check the year, then replace with NaT (Not a Time / Null)
mask_1900 = pd.to_datetime(combined_df["service_computation_date_leave"], errors="coerce").dt.year == 1900
combined_df.loc[mask_1900, "service_computation_date_leave"] = pd.NaT
print(f"Nullified {mask_1900.sum()} records with a 1900 service computation date.")
Nullified 253 records with a 1900 service computation date.
8. Fix Length of Service > 100 years
The OPM data includes records with a length of service inaccurately calculated at 125 years or greater. Recalculating by subtracting service computation date from personnel action date and setting floor at 0 years.
mask_100_years = combined_df["length_of_service_years"] > 100
# Calculate the actual difference in years
# We divide by 365.25 to properly account for leap years in the calculation
end_dates = pd.to_datetime(combined_df["personnel_action_effective_date_yyyymm"], errors="coerce")
start_dates = pd.to_datetime(combined_df["service_computation_date_leave"], errors="coerce")
calculated_years = (end_dates - start_dates).dt.days / 365.25
# Force any negative calculations to be 0
calculated_years = calculated_years.clip(lower=0)
# Replace the >100 values with our calculated difference
combined_df.loc[mask_100_years, "length_of_service_years"] = calculated_years[mask_100_years]
# Round the entire column to 1 decimal place to match the dataset's native formatting
combined_df["length_of_service_years"] = combined_df["length_of_service_years"].round(1)
# Extract the newly calculated values to find the min and max
if mask_100_years.sum() > 0:
new_values = combined_df.loc[mask_100_years, "length_of_service_years"]
min_val = new_values.min()
max_val = new_values.max()
print(f"Recalculated {mask_100_years.sum()} records with > 100 years of service.")
print(f" --> Smallest new value: {min_val} years")
print(f" --> Largest new value: {max_val} years")
else:
print("No records found with > 100 years of service.")
Recalculated 1354 records with > 100 years of service.
--> Smallest new value: 0.0 years
--> Largest new value: 40.9 years
8. Type Casting: Booleans
Standardizing boolean fields.
boolean_mapping = {"Y": True, "N": False}
boolean_cols = [
"nsftp_indicator",
"drp_indicator",
"veteran_indicator"
]
for col in boolean_cols:
if col in combined_df.columns:
combined_df[col] = combined_df[col].map(boolean_mapping)
combined_df[col] = combined_df[col].astype("boolean")
9. Type Casting: Categoricals as Strings
The OPM dataset includes numberous duplicative fields for codes and their descriptors. We set all codes as strings for future use.
# Find all code columns, plus any others that should strictly be text
code_cols = [col for col in combined_df.columns if col.endswith('_code')]
code_cols.extend(['grade']) # Add grade to ensure '04', '09' don't become 4.0, 9.0
for col in code_cols:
if col in combined_df.columns:
combined_df[col] = combined_df[col].astype(str).str.replace(r'\.0$', '', regex=True).replace('nan', np.nan)
print(f"Sanitized {len(code_cols)} identifier columns to clean strings.")
Sanitized 27 identifier columns to clean strings.
10. Drop Useless Columns
The column 'count' consisted entirely of the value '1' for every row in the data. Ain't nobody got time for that.
columns_to_drop = ["count"]
existing_cols_to_drop = [col for col in columns_to_drop if col in combined_df.columns]
if existing_cols_to_drop:
combined_df.drop(columns=existing_cols_to_drop, inplace=True)
print(f"Dropped columns: {existing_cols_to_drop}")
Dropped columns: ['count']
11. Create JSON Lookup File and Drop Descriptions
To reduce the download file size in Observable and start displaying data faster, this section creates a JSON lookup file with unique coded values and drops duplicative fields before generating the parquet file. The JSON file will be used to recreate data labels in Observable.
import json
# Define pairs of (Code Column, Description Column)
lookup_pairs = [
("accession_category_code", "accession_category"),
("agency_code", "agency"),
("agency_subelement_code", "agency_subelement"),
("appointment_type_code", "appointment_type"),
("bargaining_unit_code", "bargaining_unit"),
("consolidated_statistical_area_code", "consolidated_statistical_area"),
("core_based_statistical_area_code", "core_based_statistical_area"),
("duty_station_country_code", "duty_station_country"),
("duty_station_county_code", "duty_station_county"),
("duty_station_state_code", "duty_station_state"),
("education_level_code", "education_level"),
("flsa_category_code", "flsa_category"),
("locality_pay_area_code", "locality_pay_area"),
("occupational_category_code", "occupational_category"),
("occupational_group_code", "occupational_group"),
("occupational_series_code", "occupational_series"),
("pay_basis_code", "pay_basis"),
("pay_plan_code", "pay_plan"),
("position_occupied_code", "position_occupied"),
("step_or_rate_type_code", "step_or_rate_type"),
("supervisory_status_code", "supervisory_status"),
("tenure_code", "tenure"),
("work_schedule_code", "work_schedule")
]
lookup_dict = {}
for code_col, desc_col in lookup_pairs:
# Ensure both columns exist to avoid KeyErrors
if code_col in combined_df.columns and desc_col in combined_df.columns:
# Keep only rows where both the code and description exist
valid_pairs = combined_df[[code_col, desc_col]].dropna()
# Drop duplicates and convert to a dictionary mapping: { "Code": "Description" }
mapping = valid_pairs.drop_duplicates(subset=[code_col]).set_index(code_col)[desc_col].to_dict()
# Store in the master dictionary using the description column name as the category key
lookup_dict[desc_col] = mapping
# 1. Export the lookup dictionary to a JSON file for Observable
with open("xwalk.json", "w") as f:
json.dump(lookup_dict, f, indent=2)
print("Created xwalk.json!")
# 2. Drop description columns from the main DataFrame
cols_to_drop = [desc_col for _, desc_col in lookup_pairs if desc_col in combined_df.columns]
combined_df.drop(columns=cols_to_drop, inplace=True)
print(f"Dropped {len(cols_to_drop)} redundant description columns to shrink Parquet size.")
Created xwalk.json!
Dropped 23 redundant description columns to shrink Parquet size.
12. Export to Parquet
Do the thing!
combined_df.to_parquet(output_file, engine="pyarrow", index=False)
print(f"Data successfully cleaned, typed, and exported to {output_file}")
Data successfully cleaned, typed, and exported to opm_accessions.parquet