The Before Times: December 2024

Explore the federal workforce baseline heading into 2025. Use the filters below to drill down into specific agencies, components, occupations, and demographics to compare against current workforce separations and trends.

Additional charts may be added over time.

Total Workforce Baseline

Average Service (Years)

Veterans

Supervisors & Leadership

STEM Professionals

Bargaining Unit Eligible

Before Times Employee Sample Records

The Before Times: Data Dictionary & Processing

This page focuses on record level data from OPM's Enterprise Human Resources Integration (EHRI) Status dataset from December 31, 2024. According to OPM, EHRI Status is a snapshot of the federal workforce on the last day of each month. Status data files include all active employees (in either a pay or non-pay status) for each agency's workforce as of the last day of each month.

Click here to see review the data dictionary for this dataset.

The Before Times: Data Dictionary

A limited subset of the dataset was retained to allow baseline comparisons with 2025 separation data. Explore the retained data types below.

Click here to see how this data was pre-processed in Jupyter Notebook

The Before Times: Data Processing

This notebook processess raw, record level data from OPM's Enterprise Human Resources Integration (EHRI) Status dataset. According to OPM, EHRI Status is a snapshot of the federal workforce on the last day of each month. Status data files include all active employees (in either a pay or non-pay status) for each agency's workforce as of the last day of each month. This notebook focuses on data from December 31, 2024.

1. Configure I/O

import pandas as pd
import numpy as np
import glob

txt_files = glob.glob("employment_*.txt")
output_file = "before_times.parquet"

print(f"Found {len(txt_files)} files to process:")
for f in txt_files:
    print(f" - {f}")

Found 1 files to process:
 - employment_202412_2_2026-03-30.txt

2. Load Data in Pandas

The OPM data comes in a single, pipe-delimited text file nearing 2GB in size. Here we're checking for multiple files in the event we want to expand this project's scope in the future.

NOTE: Monthly datasets will contain millions of duplicate records, so processing multiple files at once may be counterproductive. For now, we just need one.

dataframes = []
for file in txt_files:
    try:
        # Read csv into Pandas
        df = pd.read_csv(file, 
                         sep='|',            # Separate by pipes
                         engine='python',    # Use the python engine which is slower but more flexible
                         on_bad_lines='skip' # Skip lines with unexpected number of fields
                        )
        # Add the successfully loaded dataframe to the list
        dataframes.append(df)
        print(f"Successfully loaded file with {len(df)} rows")
    except Exception as e:
        print(f"Error: {e}")
        
        # On error, inspect the file
        print("\nShowing first few lines of the raw file:")
        with open("employment_202412_2_2026-03-30.txt", 'r') as f:
            for i, line in enumerate(f):
                if i < 10:  # Print first 10 lines
                    print(f"Line {i+1}: {line.strip()}")
                else:
                    break

# Check if we have any dataframes to concatenate
if dataframes:
    cdf = pd.concat(dataframes, ignore_index=True)
    print(f"\nTotal records loaded: {len(cdf)}")
else:
    print("No dataframes were successfully loaded.")

Successfully loaded file with 2312301 rows

Total records loaded: 2312301

3. Clean Strings

The OPM data includes a variety of missing, irrelevant, or redacted fields. We convert them to null for ease of processing.

missing_value_indicators = ["REDACTED", "INVALID", "NO DATA REPORTED", "*", " "]

# Count the occurrences before they are replaced
values_to_nullify = cdf.isin(missing_value_indicators).sum().sum()
records_affected = cdf.isin(missing_value_indicators).any(axis=1).sum()

# Replace missing values
cdf.replace(missing_value_indicators, np.nan, inplace=True)

# Print the results
print(f"Number of values nullified: {values_to_nullify}")
print(f"Number of records affected: {records_affected}")

Number of values nullified: 20447679
Number of records affected: 1311808

4. Type Casting: Numeric Values

Standardizing numeric fields.

numeric_cols = [
    "length_of_service_years"
]
for col in numeric_cols:
    if col in cdf.columns:
        cdf[col] = pd.to_numeric(cdf[col], errors="coerce")

5. Type Casting: Dates as Date32

Standardizing date values. This field is only retained long enough to validate length of service.

date_cols = [
    "service_computation_date_leave"
]
for col in date_cols:
    if col in cdf.columns:
        cdf[col] = pd.to_datetime(cdf[col], errors="coerce").dt.date

6. Remove Wrong SCDs

Convert to pandas datetime, check if the year is 1900, then replace with NaT (Not a Time, i.e., null)

mask_1900 = pd.to_datetime(cdf["service_computation_date_leave"], errors="coerce").dt.year == 1900
cdf.loc[mask_1900, "service_computation_date_leave"] = pd.NaT
print(f"Nullified {mask_1900.sum()} records with a 1900 service computation date.")

Nullified 18 records with a 1900 service computation date.

7. Fix Length of Service > 100 years

Previous OPM datasets included records with a length of service inaccurately calculated at greater than 100 years. We check for erroneous data and correct it by subtracting service computation date from December 31, 2024 and set a floor of 0 years.

NOTE: end_dates is currently hard coded, and may need updated if this notebook is used with other datasets.

# Print the min and max values before any conversions
print("Raw Service Range:")
print(f"  --> Minimum length of service: {cdf['length_of_service_years'].min()} years")
print(f"  --> Maximum length of service: {cdf['length_of_service_years'].max()} years")
print(f"  --> Number of records with > 100 years: {(cdf['length_of_service_years'] > 100).sum()}")
print("-" * 50)

mask_100_years = cdf["length_of_service_years"] > 100

# Calculate the actual difference in years 
# We divide by 365.25 to properly account for leap years in the calculation
end_dates = pd.to_datetime('2024-12-31')  # Using December 31, 2024 as static end date
start_dates = pd.to_datetime(cdf["service_computation_date_leave"], errors="coerce")
calculated_years = (end_dates - start_dates).dt.days / 365.25

# Force any negative calculations to be 0
calculated_years = calculated_years.clip(lower=0)

# Replace the >100 values with our calculated difference
cdf.loc[mask_100_years, "length_of_service_years"] = calculated_years[mask_100_years]

# Round the entire column to 1 decimal place to match the dataset's native formatting
cdf["length_of_service_years"] = cdf["length_of_service_years"].round(1)

# Extract the newly calculated values to find the min and max
if mask_100_years.sum() > 0:
    new_values = cdf.loc[mask_100_years, "length_of_service_years"]
    min_val = new_values.min()
    max_val = new_values.max()
    
    print(f"Recalculated {mask_100_years.sum()} records with > 100 years of service.")
    print(f"  --> Smallest new value: {min_val} years")
    print(f"  --> Largest new value: {max_val} years")
else:
    print("No records found with > 100 years of service.")

Raw Service Range:
  --> Minimum length of service: 0.0 years
  --> Maximum length of service: 80.3 years
  --> Number of records with > 100 years: 0
--------------------------------------------------
No records found with > 100 years of service.

8. Type Casting: Booleans

Standardizing boolean fields.

boolean_mapping = {"Y": True, "N": False}
boolean_cols = [ 
    "veteran_indicator"
]
for col in boolean_cols:
    if col in cdf.columns:
        cdf[col] = cdf[col].map(boolean_mapping)
        cdf[col] = cdf[col].astype("boolean")

9. Type Casting: Categoricals

Optimizing remaining object fields by evaluating whether they can be converted into categorical fields based on uniqueness.

for col in cdf.select_dtypes(include=["object"]).columns:
    if cdf[col].nunique() < 100:
        cdf[col] = cdf[col].astype("category")

10. Space Saver: Dropping Columns

This dataset is being used as a baseline reference, which only requires a small subset of the original data.

Current remaining fields for comparison include:

age_bracket
agency
agency_subelement
bargaining_unit_status
length_of_service_years
occupational_series_code
position_occupied
stem_occupation_type
supervisory_status
veteran_indicator

Additional fields may be added or removed based on future needs.

columns_to_drop = [
    "agency_code",
    "agency_subelement_code",
    "annualized_adjusted_basic_pay",
    "appointment_type",
    "appointment_type_code",
    "bargaining_unit",
    "bargaining_unit_code",
    "cfo_act_agency_indicator",
    "consolidated_statistical_area",
    "consolidated_statistical_area_code",
    "core_based_statistical_area",
    "core_based_statistical_area_code",
    "count",
    "duty_station_code",
    "duty_station_country",
    "duty_station_country_code",
    "duty_station_county",
    "duty_station_county_code",
    "duty_station_state",
    "duty_station_state_abbreviation",
    "duty_station_state_code",
    "duty_station_state_country_territory_code",
    "education_level",
    "education_level_bracket",
    "education_level_code",
    "flsa_category",
    "flsa_category_code",
    "grade",
    "locality_pay_area",
    "locality_pay_area_code",
    "nsftp_indicator",
    "occupational_category",
    "occupational_category_code",
    "occupational_group",
    "occupational_group_code",
    "occupational_series",
    "pay_basis",
    "pay_basis_code",
    "pay_plan",
    "pay_plan_code",
    "personnel_office_identifier_code",
    "position_occupied_code",
    "service_computation_date_leave",
    "snapshot_yyyymm",
    "stem_occupation",
    "step_or_rate_type",
    "step_or_rate_type_code",
    "supervisory_status_code",
    "tenure",
    "tenure_code",
    "work_schedule",
    "work_schedule_code"
]
existing_cols_to_drop = [col for col in columns_to_drop if col in cdf.columns]
if existing_cols_to_drop:
    cdf.drop(columns=existing_cols_to_drop, inplace=True)
    print(f"Dropped {len(existing_cols_to_drop)} columns.")

Dropped 52 columns.

11. Export to Parquet

Go forth, and do great things!

cdf.to_parquet(output_file, engine="pyarrow", index=False)
print(f"Data successfully cleaned, typed, and exported to {output_file}")

Data successfully cleaned, typed, and exported to before_times.parquet