The Before Times: December 2024
Explore the federal workforce baseline heading into 2025. Use the filters below to drill down into specific agencies, components, occupations, and demographics to compare against current workforce separations and trends.
Additional charts may be added over time.
Total Workforce Baseline
Average Service (Years)
Veterans
Supervisors & Leadership
STEM Professionals
Bargaining Unit Eligible
Before Times Employee Sample Records
The Before Times: Data Dictionary & Processing
This page focuses on record level data from OPM's Enterprise Human Resources Integration (EHRI) Status dataset from December 31, 2024. According to OPM, EHRI Status is a snapshot of the federal workforce on the last day of each month. Status data files include all active employees (in either a pay or non-pay status) for each agency's workforce as of the last day of each month.
Click here to see review the data dictionary for this dataset.
The Before Times: Data Dictionary
A limited subset of the dataset was retained to allow baseline comparisons with 2025 separation data. Explore the retained data types below.
Click here to see how this data was pre-processed in Jupyter Notebook
The Before Times: Data Processing
This notebook processess raw, record level data from OPM's Enterprise Human Resources Integration (EHRI) Status dataset. According to OPM, EHRI Status is a snapshot of the federal workforce on the last day of each month. Status data files include all active employees (in either a pay or non-pay status) for each agency's workforce as of the last day of each month. This notebook focuses on data from December 31, 2024.
1. Configure I/O
import pandas as pd
import numpy as np
import glob
txt_files = glob.glob("employment_*.txt")
output_file = "before_times.parquet"
print(f"Found {len(txt_files)} files to process:")
for f in txt_files:
print(f" - {f}")
Found 1 files to process:
- employment_202412_2_2026-03-30.txt
2. Load Data in Pandas
The OPM data comes in a single, pipe-delimited text file nearing 2GB in size. Here we're checking for multiple files in the event we want to expand this project's scope in the future.
NOTE: Monthly datasets will contain millions of duplicate records, so processing multiple files at once may be counterproductive. For now, we just need one.
dataframes = []
for file in txt_files:
try:
# Read csv into Pandas
df = pd.read_csv(file,
sep='|', # Separate by pipes
engine='python', # Use the python engine which is slower but more flexible
on_bad_lines='skip' # Skip lines with unexpected number of fields
)
# Add the successfully loaded dataframe to the list
dataframes.append(df)
print(f"Successfully loaded file with {len(df)} rows")
except Exception as e:
print(f"Error: {e}")
# On error, inspect the file
print("\nShowing first few lines of the raw file:")
with open("employment_202412_2_2026-03-30.txt", 'r') as f:
for i, line in enumerate(f):
if i < 10: # Print first 10 lines
print(f"Line {i+1}: {line.strip()}")
else:
break
# Check if we have any dataframes to concatenate
if dataframes:
cdf = pd.concat(dataframes, ignore_index=True)
print(f"\nTotal records loaded: {len(cdf)}")
else:
print("No dataframes were successfully loaded.")
Successfully loaded file with 2312301 rows
Total records loaded: 2312301
3. Clean Strings
The OPM data includes a variety of missing, irrelevant, or redacted fields. We convert them to null for ease of processing.
missing_value_indicators = ["REDACTED", "INVALID", "NO DATA REPORTED", "*", " "]
# Count the occurrences before they are replaced
values_to_nullify = cdf.isin(missing_value_indicators).sum().sum()
records_affected = cdf.isin(missing_value_indicators).any(axis=1).sum()
# Replace missing values
cdf.replace(missing_value_indicators, np.nan, inplace=True)
# Print the results
print(f"Number of values nullified: {values_to_nullify}")
print(f"Number of records affected: {records_affected}")
Number of values nullified: 20447679
Number of records affected: 1311808
4. Type Casting: Numeric Values
Standardizing numeric fields.
numeric_cols = [
"length_of_service_years"
]
for col in numeric_cols:
if col in cdf.columns:
cdf[col] = pd.to_numeric(cdf[col], errors="coerce")
5. Type Casting: Dates as Date32
Standardizing date values. This field is only retained long enough to validate length of service.
date_cols = [
"service_computation_date_leave"
]
for col in date_cols:
if col in cdf.columns:
cdf[col] = pd.to_datetime(cdf[col], errors="coerce").dt.date
6. Remove Wrong SCDs
Convert to pandas datetime, check if the year is 1900, then replace with NaT (Not a Time, i.e., null)
mask_1900 = pd.to_datetime(cdf["service_computation_date_leave"], errors="coerce").dt.year == 1900
cdf.loc[mask_1900, "service_computation_date_leave"] = pd.NaT
print(f"Nullified {mask_1900.sum()} records with a 1900 service computation date.")
Nullified 18 records with a 1900 service computation date.
7. Fix Length of Service > 100 years
Previous OPM datasets included records with a length of service inaccurately calculated at greater than 100 years. We check for erroneous data and correct it by subtracting service computation date from December 31, 2024 and set a floor of 0 years.
NOTE: end_dates is currently hard coded, and may need updated if this notebook is used with other datasets.
# Print the min and max values before any conversions
print("Raw Service Range:")
print(f" --> Minimum length of service: {cdf['length_of_service_years'].min()} years")
print(f" --> Maximum length of service: {cdf['length_of_service_years'].max()} years")
print(f" --> Number of records with > 100 years: {(cdf['length_of_service_years'] > 100).sum()}")
print("-" * 50)
mask_100_years = cdf["length_of_service_years"] > 100
# Calculate the actual difference in years
# We divide by 365.25 to properly account for leap years in the calculation
end_dates = pd.to_datetime('2024-12-31') # Using December 31, 2024 as static end date
start_dates = pd.to_datetime(cdf["service_computation_date_leave"], errors="coerce")
calculated_years = (end_dates - start_dates).dt.days / 365.25
# Force any negative calculations to be 0
calculated_years = calculated_years.clip(lower=0)
# Replace the >100 values with our calculated difference
cdf.loc[mask_100_years, "length_of_service_years"] = calculated_years[mask_100_years]
# Round the entire column to 1 decimal place to match the dataset's native formatting
cdf["length_of_service_years"] = cdf["length_of_service_years"].round(1)
# Extract the newly calculated values to find the min and max
if mask_100_years.sum() > 0:
new_values = cdf.loc[mask_100_years, "length_of_service_years"]
min_val = new_values.min()
max_val = new_values.max()
print(f"Recalculated {mask_100_years.sum()} records with > 100 years of service.")
print(f" --> Smallest new value: {min_val} years")
print(f" --> Largest new value: {max_val} years")
else:
print("No records found with > 100 years of service.")
Raw Service Range:
--> Minimum length of service: 0.0 years
--> Maximum length of service: 80.3 years
--> Number of records with > 100 years: 0
--------------------------------------------------
No records found with > 100 years of service.
8. Type Casting: Booleans
Standardizing boolean fields.
boolean_mapping = {"Y": True, "N": False}
boolean_cols = [
"veteran_indicator"
]
for col in boolean_cols:
if col in cdf.columns:
cdf[col] = cdf[col].map(boolean_mapping)
cdf[col] = cdf[col].astype("boolean")
9. Type Casting: Categoricals
Optimizing remaining object fields by evaluating whether they can be converted into categorical fields based on uniqueness.
for col in cdf.select_dtypes(include=["object"]).columns:
if cdf[col].nunique() < 100:
cdf[col] = cdf[col].astype("category")
10. Space Saver: Dropping Columns
This dataset is being used as a baseline reference, which only requires a small subset of the original data.
Current remaining fields for comparison include:
- age_bracket
- agency
- agency_subelement
- bargaining_unit_status
- length_of_service_years
- occupational_series_code
- position_occupied
- stem_occupation_type
- supervisory_status
- veteran_indicator
Additional fields may be added or removed based on future needs.
columns_to_drop = [
"agency_code",
"agency_subelement_code",
"annualized_adjusted_basic_pay",
"appointment_type",
"appointment_type_code",
"bargaining_unit",
"bargaining_unit_code",
"cfo_act_agency_indicator",
"consolidated_statistical_area",
"consolidated_statistical_area_code",
"core_based_statistical_area",
"core_based_statistical_area_code",
"count",
"duty_station_code",
"duty_station_country",
"duty_station_country_code",
"duty_station_county",
"duty_station_county_code",
"duty_station_state",
"duty_station_state_abbreviation",
"duty_station_state_code",
"duty_station_state_country_territory_code",
"education_level",
"education_level_bracket",
"education_level_code",
"flsa_category",
"flsa_category_code",
"grade",
"locality_pay_area",
"locality_pay_area_code",
"nsftp_indicator",
"occupational_category",
"occupational_category_code",
"occupational_group",
"occupational_group_code",
"occupational_series",
"pay_basis",
"pay_basis_code",
"pay_plan",
"pay_plan_code",
"personnel_office_identifier_code",
"position_occupied_code",
"service_computation_date_leave",
"snapshot_yyyymm",
"stem_occupation",
"step_or_rate_type",
"step_or_rate_type_code",
"supervisory_status_code",
"tenure",
"tenure_code",
"work_schedule",
"work_schedule_code"
]
existing_cols_to_drop = [col for col in columns_to_drop if col in cdf.columns]
if existing_cols_to_drop:
cdf.drop(columns=existing_cols_to_drop, inplace=True)
print(f"Dropped {len(existing_cols_to_drop)} columns.")
Dropped 52 columns.
11. Export to Parquet
Go forth, and do great things!
cdf.to_parquet(output_file, engine="pyarrow", index=False)
print(f"Data successfully cleaned, typed, and exported to {output_file}")
Data successfully cleaned, typed, and exported to before_times.parquet