LIONS data import and program documentation
Prepared by Sarah Cohen, July 2022
What is LIONS data?
This data goes by various names, usually called “LIONS” or the US Attorney Case Processing statistics. It stands for “Legal Information Office Network System”. More recent uses of the data call it “CaseView”, which is a new-ish version of the front end used by US Attorney offices. There was little change in the underlying database.
It was originally released after a long court fight with David Burnham and Susan Long, who started the Transactional Records Access Clearinghouse Syracuse University. Unfortunately, that group has been unwilling to really work with reporters over a long period of time, so we’ve been stuck with what they felt like doing – and not that confident they know what they’re doing.
Other reporters have taken a stab at working on this, but it really requires some concentrated time and, most importantly, sources who can help us understand what we’re seeing.
General description
The big differences between this database and a PACER database of federal court records are:
This contains every referral to a US Attorney, including those that are not charged or settled publicly. It means you can analyze what cases are being chosen and what priorities are being emphasized and de-emphasized.
It has no names or identifying information – although it is case level data, you will have to report out any cases you care about. For those that had court actions, it’s not too hard: the date and district is listed, and searching PACER for cases filed on a specific day in a specific court is not very hard. Without good human sources, it will be extremely difficult to get the names and details for declinations – by definition, these are sensitive law enforcement records on cases that were never brought.
The data comes as a highly relational database of about 20-30 tables. (There are actually 400+ tables, many of them are district-level breakouts of one table when it would be too large a file to put them together.) A good deal of the database is more detailed than we need. For example, there are “events”, which include getting the case, declining it, bringing it to court, etc. Then there are “participants” and “participant events” that deal with individual people within cases.
It’s almost impossible to match data produced elsewhere from this data. (See the specific readme on a comparison of sources.) Because of errors in the database, redactions, and timing, the data just doesn’t match. Believe me. I’ve spent too many weeks of my life trying to do this. I have not found a good source at Justice to walk me through how their reports are produced at a very detailed level.
Source notes
There are 28 .zip files included in the download from the US Department of Justice’s website at https://www.justice.gov/usao/resources/foia-library/national-caseload-data as of May 2022 . The exact configuration of which file is in which zip file changes almost monthly. They’re moved around as they attempt to keep each file below about 3 gb. When they run out of ways to move them around, they add a new file. Each month is cumulative.
One of the best detailed descriptions that I’ve seen of some of the details of the database is in the 2018 Annual Report to Congress on Indian Country Investigations and Prosecutions. Beginning around Page 20, there are descriptions of how LIONS processes the cases, the difference between “matters” and “cases”, and how program codes and declinations are assigned and captured. Other details on how the data is entered and determined is in the “LIONS User Manual” that is distributed with the data, last updated in 2016. Other documents reference the CaseView User Manual, but our understanding is that it is very similar, except in how it looks.
Here are a few details from the data entry manual on different tables that describe important elements:
5.4.2: Court History – Creates a history of all court filings for each case. A new Court History record is opened to track the court and the court’s location, as well as the judges associated with the case for each stage (indictment, superseding indictment, post disposition, appeal) of the case. Prior to any filing, a “not in court” record is opened. (NOTE: Most of what we do filters for three courts: the Not in court entry, District court and Magistrate court. That removes a lot of the difficulty with post-dispositions, weird courts like D.C. Superior, search warrants, and appeals.)
Events - Tracks the significant steps in the progression of the case.Events are created by your district and may be as specific or as general as the district deems necessary. (NOTE: I’ve ignored these - they’re useful if you’re going to look in detail at any given case, but by then I would hope we’re looking at the court records anyway. There’s hardly anything in these for declined cases.)
Completeness and timing
This is dated, but in 1998 the BJS did an analysis of the completeness of this data. Here’s what it said:
“US attorneys initiate approximately 95 percent of the criminal cases prosecuted by the department. . Litigating divisions of the Department handle the remaining cases. While many of the cases initiated by the litigating divisions are included in the U.S. attorneys’ database, each division maintains its own separate data collection system to track cases processed.”
I don’t know this for sure, but I suspect that cases that a) could move between civil and criminal relatively simply, such as regulatory offenses or charges against companies, and b) anything that has its own division or task force in the litigating divisions ( anti-trust, civil rights, hate crimes, COVID fraud) are represented less well in the data than more routine cases ( immigration, drugs, or firearms).
Relational structure
One key insight that it took me a long time to understand is that the same data points are in several different places - it appears to me that they create convenience tables so they don’t have to do as much joining to either populate data entry forms or to create reports. But they don’t document how that is done, so we don’t know which of many possible values they put in some of them.
For example, dispositions are in the court history
and the participant court history
tables. According to the documentation, everything is entered at the defendant level – even when they enter it at the case level, it populates the defendant table. But they never say how they choose WHICH defendant to use in the disposition when there is more than one. There are many possible answers to this: The last one to be finished, which signals the whole case is finished; the first one because it may be something like a lead defendant, or even alphabetically (which we don’t have).
The same thing happens with charges, instruments and counts. The charges in the count data are the same as the charges in the instruments data, with more detail. It’s also unclear how they determine the disposition of a defendant when there are many counts and charges.
As a result, it’s really hard to reproduce tables created from this dataset, since a) there aren’t really any edit checks to make sure impossible things don’t happen (it’s not got a lot of internal integrity), and b) you don’t know WHICH table they are pulling things from and WHEN.
Case categories and charges
I have really only looked at the criminal side of this data, which is indicated by an “R” type of case (as opposed to a “V” type).
Lead charges / counts
Charges are found in several locations in the database and the translations are incomplete.
They’re also really difficult to work with because each crime might be identified at a different level of detail. For example, Title 18 Section 922 is “Unlawful acts” related to firearms. 18 USC 922g is possession by a felon or other prohibited person. 18 USC 922g1 is possession by a convicted felon. (Firearms is a special case, where they have to enter the detailed statute.) Many charges have no more specifics than the section or part. TRAC and other sources tend to produce reports only at the section level , not the specific charge.
In any of the relevant tables – the lead charge, the instrument charges and counts (which refer to specific defendants). I then used regular expressions to assign them as US Code, Federal Regulation, State or territorial , or other “Special” code, and created a section_code and section_desc column to associate the charge with its higher level section. This in the file data/clean/lookups/lkp_charge_master.Rda
.
At some point, if we wanted to do real analysis of them, it would be useful to break get the chapters associated with each section, but that’s actually a little harder than it sounds. I wasn’t able to find a simple lookup hierarchy in easily digestible form.
Lead charge
According to the user manual, the lead charge is assigned at the point that they create a case file, and should be “the substantive statute that is the primary reason for referring” the case from the investigating agency. However, there are three caveats:
The lead charge should be “substantive”. I’m not sure I understand what that means, but the user manual specifically says that it should NOT refer to a penalty, and it should NOT be “aiding and abetting”. That said, there are hundreds of thousands of cases in which the lead charge refers to a penalty, aiding or abetting, or a special jurisdictional statute.
The lead charge should be reviewed when charges are filed, and modified as needed. However, Susan Long, a founder of Syracuse’s TRAC program, said they are never changed. I’m not sure that’s true. We could potentially check by looking at old versions of the database when we know it’s very different, but that would be a big project. From what I see in the user manual, there is no prompt when charges are filed to review this code, so it’s unclear how often in practice it happens.
It may or may not be either the first count, or it may or may not be the charge with the largest penalty (though that is often the case). Reviews of the court processing statistics by BJS and the Justice Department’s OIG say that these can be kind of random – they might choose a drug charge, or they might choose a weapons charge if they’re both relevant, and there is no guidance to tell them which one is more important.
How they enter charges
The instructions say that they should:
Enter the two-digit Title with a leading zero if necessary
Leave a blank space if it’s from the U.S. code (federal law), or “D” for D.C. code, “R” for the Code of Federal Regulations , “S” for a state code, and “T” for Virgin Islands code.
Followed by a colon (:) and the five-digit section number if it’s the US code, with leading zeros. Then enter the subsection and numbers.
However, there are thousands of charges in the database that don’t follow that pattern. Although it seems that there are a lot, they refer to very few cases. Many of the codes with missing translations are state code references. From what I can tell, there is a weak edit check for this. (The US code and regulation fields are lookups, so they can pick them out of a list but it appears they can override if they want)
Charge severity
Unlike courts and the US Sentencing Commission, there is no code in this database that tells you how bad a charge is – there’s no concept of a severity, other than felony, misdemeanor or petty offense. There is a reference to the statute that defines the penalties for a given charge, but nothing in the data that tells you, for example, what the maximum or minimum sentence is for a charge. This makes it really hard to make a list of the “top” charge – you kind of have to work with what they have as the count number, and hope that this is what was done in the court papers.
There is a sentencing table for convictions, but I’ve never spent much time with it. I think there are better avenues for finding information on sentences than this database.
Program category codes
The program codes are the broad categories of criminal cases. (Cause of action is the equivalent in civil) . Detailed descriptions and groupings of the program category codes are shown beginning on page A-69 of the appendix a document distributed with the data.
There are important instructions here. For example, they added a code for transnational organized crime in 2014, but say that they should first use the primary substantive offense (drug trafficking) then add that new code as a second or third one. The national priorities are identified in this list.
According to the user manual, the AUSA assigns the “substantive” program code first, but then also assigns a secondary one where the federal governmant has “special jurisdiction, e.g. assimilated crimes or offenses in Indian country. This means that these cases will often have a first code, eg,”Firearms” and a secondary code of 092 (Violent crime in indian country), 065 (Indian offenses, non-violent crime) or 084 (assimilated crimes). They can also choose among various codes, and there is no guidance about which one is more important. Drug and gun cases are often interchangeable; civil rights and hate crimes are as well. White collar crime has very specific program categories, but it seems like few people use them, and the lion’s share of white collar crime is under “postal fraud”, which also refers to internet fraud; and “other white collar crime”.
It’s worth remembering that there is very little violent crime that is the purview of the US (rather than state or local) government , so almost all of it except for bank robbery and firearms are special jurisdiction.
Dispositions
Dispositions of criminal matters / cases
The master case table has no outcome of the case. The outcomes are shown in a variety of tables at differing levels of detail. They are:
- Court histories: This is the table created from the data entry shown above. It is the only place to find the court (or NC, for not in court). It includes filing dates, which match the dates of in the
instruments
table. It includes a disposition for at the case level, but it’s unclear how they determine this if it’s got more than one defendant. (Not in court entries contain the outcome of matters never sent to court.)
- Participant court histories: These include histories for each participant, not just each defendant. Usually, we filter this for defendants. The redactions are such that any
role
that begins with “D” is a defendant. (It’sD*
for redacted cases, and then a more specific type of defendant in a very few set of cases, such as officials in public corruption cases. This does not have the filing dates – just the disposition dates and dispositions. It’s unclear how they determine this if some counts are dismissed and others have a guilty plea, but it appears. (Not in court entries contain the outcome of matters never sent to court for a defendant.) About 85 percent of all cases have only one defendant, but they are often the ones we don’t care about. Complicated white collar crimes, drug trafficking and the like often have many defendants who come and go from cases.
- Counts: This is each count for each charge brought in court, but there is no information in this table (or in the charges table) if there was nothing brought to court. It means that if you are looking at declinatins, the only charge or topic known is the original one in the lead charge and program category in the case table.
There are errors and other types of entries in this data that refer to entities other than defendants, such as search warrants and cases against properties. These can mainly be filtered out by restricting queries to specific courts: “NC” (not in court); “MG” (magistrate) and “DC” (District)
Case status
The data includes active, closed and re-opened matters. The OIG has said the the DOJ should ignore the reopened cases, since they are double-counted in the data. I believe reopened cases come from two sources: A return from an appeal, where they don’t start a new case, and cases where new evidence has emerged that means they want to reverse a dismissal. They’re pretty rare, but I haven’t really dealt with how to account for them.
Declination codes (DE and ID dispositions)
Disposition reason codes for criminal declinations were simplified in 2014. They are mapped to the new ones with further explanation on page A-49 of the the file distributed as appendix_a.pdf with the LIONS data in the 2016 version. I created a lookup table mapping the new (more general) and old (more detailed) codes. There is only one translation list - all of the new codes are unique in the original lookup.
Opened in error / office error (OE disposition, or OEOE in disposition reason)
According to the documentation, these are items that really were added by mistake. USAO offices are not allowed to delete anything - this is an indication that they should be deleted. (I guess it’s possible to do something nefarious with this, but I don’t think we’d be able to get at that. There would be no public record of it.)
All work completed (CW disposition or AWCP disposition reason)
AWCP codes should refer only to miscellaneous matters only, in courts MM or MD. (Miscellaneous filings are those that are not related to a specific case filing. Instead, they’re search warrants, Rule 5 hearings , etc. (A Rule 5 hearing is a first initial appearance where you’re asked about whether you have a lawyer, etc. It comes before the detention hearing, which is the important one. ) I don’t know that we need to keep those, since there will be a complaint or indictment filed in magistrate court before this happens. They also refer to instances in which they have to get a search warrant, or file charges against property before linking it to a person or company.
Getting a chronology
The other places that use this data split it into three sections: matters that were never filed as a case; cases that were completed in magistrate court; and cases that went to district court. They appear to ignore most everything else. They come from either the court history or participant court history. To the extent possible, I have worked at the defendant level, which is how most other places deal with outcomes, including in the US Administrative Office of Courts.
Matters that were declined should have only one entry in the court history. Those that went to court, however, often have several entries, not all of which make sense. In recent years, the first entry is Not in Court , with a “NW” (New) filing, with a disposition reason of “INDT”, “INFO” or “CMPL” , referring to the charging instrument. Then there is often an entry in Magistrate court, with another NW disposition, this time referring to the indictment. Then there is the District Court entry, which is the substantive one. Many cases then have post-disposition entries, which include things like returns from appeals courts, revocation of probation, or sentencing appeals. These are supposed to be filed under a “Post-dispostion” court, but they sometimes aren’t.
One of the most confusing things to deal with (as in the courts) is superseding indictments. For completed cases, these work themselves out. The chronology built from the disposition date will get you to the latest thing that happened. But this is one reason to avoid case level processing. The superseding indictments will naturally sort to the end of the case, and you have to go through a lot to get to the actual disposition.
Dates
The filing dates come from the instruments
table, such as a complaint, an indictment or a superceding indictment. There can be empty rows for active cases when there is a sealed indictment for fugitives or the case hasn’t been made public yet. These are filled into the court history tables, I believe as a convenience. They are almost always the same dates.
The disposition dates come from the counts, I think. They are filled in on the pariticpant history and the case history.
TRAC and the US attorneys’ offices use something called the “system” date rather than the actual dates for filing and dispositions, which makes sense if they are concerned with statistics produced each year that are never revised. But that doesn’t make much sense for us – we like to know when something actually happened, not when people entered them into the system.
More details
Useful fields that you might not notice
- Each district maintains a list of codes to categorize cases in their district. Sometimes these can be quite useful. Two recent examples are the McGirt case in Oklahoma, which spawned hundreds of Indian Country prosecutions. These are coded in the district as McGirt-related case. The other is “Capitol Riots” code in the DC district. It’s worth a try to see if the kinds of cases you’re interested in might have a code that is used across districts as well. For example, every district has a code for COVID related fraud in CIVIL court, but not criminal.
- Sealed cases. I’m not sure about this, but it seems true: In some tables, if a case is sealed or redacted, a field will be shown as an asterisk. But if it’s not, it will be blank! In the participant count table, for example, there is an indicator Y/N under “sealed”. Most of this column is blank. But there are some that have “*” in them. I don’t know if they were originally “N” as well as “Y”, but it’s possible there is actually a little sealed data in there!
Other elements
Appeals and new case numbers
Cases that are appealed and sent back to District Court for a new trial are assigned NEW case numbers. I haven’t explored how to link them back to the appeals, or whether it matters. There is a table called case_relate_appeals that, I believe, links the appeal case to the new district case. But I haven’t explored this at all.
Penalties and sentences
I haven’t looked at the penalties and sentencing data very well. I know they’re two different things: penalty goes with the charge (statute), and sentencing comes from the Judgment and Commitment Order. Sentences can only be entered if there is a disposition of GT (guilty) or NC (Nolo) . These are all at the defendant (participant) level, so I haven’t really looked at it.
Processing notes
Download script
I run an R Script (download-files.R) each month to a) download all of the files into the origzips
folder, and b) create an index to them saved in the layout-lookups
folder. This index isn’t very relevant except as it relates to the location of record layouts (for new tables, if they ever occur), and for lookup tables. The record layouts for each table is contained in a README.txt file on each “disk”, so you need to know which disk a table is in if you want to review its lookups.
I created a spreadsheet containing the contents of the ZIP files – 400 files in total – with the record layouts copied from each disk’s README file. These very rarely change – the record layouts that I ended up with in this exercise in May 2022 are virtually identical to those I created in a similar exercise in 2015.
Lookup tables
The RMarkdown document called 01-create-lookups
reads and saves all of the lookup tables (the code translations) that I think we’ll use, which don’t change very often.
Most of the lookup tables come from the spreadsheet called layout-lookups/global_translations_asof2020.xlsx
. The global ones – those that are not different by district – are provided in the last disk each month as “global_LIONS.txt”, and are a series of fixed-width tables stacked on top of one another. I pulled them apart manually rather than try to process them all at once, though that’s certainly possible.
Judges and staff lists are the only parts of the lookup tables that change regularly. These are in their own tables in table_gs_judge.txt
and table_gs_staff
The DOJ started redacting the details on the names of staff a long time ago – I started a staff_edited.Rda
table with a field for the name of the staff member as we discover them, but only a couple are filled out.
Important things to remember
Charges
If a matter is declined, the ONLY charge that we know of is the lead charge, which was what they were referred (from FBI or whatever) . When we look at cases in more detail after they’ve been charged, there are often more specific ones. Example: the McGirt case that opened after the 2020 Supreme Court decision was opened as a sexual assault. They only added “sexual assault of children” in a new indictment.
Redactions
There are two reasons for redactions, not to be confused with one another.
- Almost everything in a case is redacted before the USDOJ files a charge in court or formally declines to prosecute. All of the cases that are still under investigation are, for all practical purposes, redacted. This is SUPER important because it means that there is no lead charge OR program code for anything that has been neither declined nor charged. The only thing you know is that a district has gotten a case referred to it.
- Some information is redacted by policy or because of sensitive information. These are usually sealed (temporary) or expunged (permanent) cases or charges. The problem with this is that there is no unique identifier other than the charge for counts for a defendant. Without knowing the charge, there’s no way to link them back to the charges filed in the case. I’m not sure how important this is, but it seems like a problem.
It matters more in some kinds of cases (white collar, public corruption, etc.) than others (immigration) because of the length of time it takes to bring a case. It also means that you can’t say that there has been a decline in referrals of a certain kind of case in recent years, because you literally don’t know that to be true. AND it means that our data will not match some of the statistics published by DOJ because they can include referral types and lead charges before they’re declined or charged.
As a result, this data is more easily analyzed by looking at those that have had some action taken, not on referrals, for any given topic. In DOJ parlance, this means that open “matters” are redacted, but “cases” are not.
Tables to ignore
There are a number of tables that contain no useful information for us. I’ve documented them in the spreadsheet with the record layouts.
Events
Events are really useful for looking at a more thorough history of a case / defendant, but there is no real way to generalize across districts or across US attorneys within a district. These are set by districts, and not required by USDOJ. Ignoring these except in cases when you want to look up the whole history of a case simplifies everything and also vastly reduces the amount of space required for the analysis.
Sentencing
The sentencing information could be really interesting, but I’ve not delved into it. One question I have is whether it provides anything different than the data published by the US Sentencing Commission, which is standardized and developed for analysis on sentences.
Resources / possible sources
Urban institute
They are the contractors that create the data in the Federal Justice Statistics program. https://www.urban.org/author/emily-tiry
TRAC
I’ve looked at TRAC in detail when I did some work with the data at the Howard Center. ASU has a site license for TRAC, which is sometimes easier to look at when you want to see how a case has progressed. I am still struggling with how to use the site to extract just a single case / defendant. Sue Long is our contact there - I spoke with her in the Spring of 2021 about some questions, but she was totally unhelpful. She had no good examples of people using their data, and had no good story ideas for us. She doesn’t really understand their underlying data processing ( she described things that were pretty straightforward as too complicated for me to understand) and didn’t realize what I think I found about their data: It appeared when I looked in depth at this, that they don’t go back and correct previous fiscal years – once the end of the fiscal year comes, they lock in the data. This means that any case that is still under investigation at the end of the fiscal year HAS NO PROGRAM CODE OR LEAD CHARGE!!! I was getting like twice as many cases as them for certain topics, and this was what I finally think I figured out about it. But Susan has been so unhelpful that I couldn’t verify it.
USDOJ
Mark Motivans, Bureau of Justice Statistics expert that largely oversees the Federal Justice Statistics Program. He runs a program on data that is linked across the EOUSA, the US Sentencing Commission, and several other agencies (notably NOT the FBI) – it standardizes some of the LIONS data so it can be cross-referenced. That is NOT a public dataset – you have to get a special approval to use it, and that comes with some restrictions that most journalists would not be willing to abide by. I have not spoken with him, but he could be useful. Here’s a video of him describing how to use the linked data.
Keywords
The way I’ve found the most studies about LIONS is to search academic literature for “Legal Information Office Network System” or CaseView and USDOJ. People who use the database tend to use those keywords.