Fighting for humanity, one line of code at a time

Ricardo Rodriguez
7 min readMay 27, 2021

Crunchwraps, Assemble!

One of the last learning milestone at Lambda School was tackling a real-world project, by that I mean one with real stakeholders, managers, meetings, deadlines, a full development team, and one month to deliver the product.

My team (which democratically chose Team Crunchwrap as our official name) worked with Human Rights First, a non-profit, nonpartisan, international human rights organization based in New York, NY, Washington D.C., Houston, TX, and Los Angeles, CA. HRF’s goal is a simple yet challenging one; to hold the U.S. Government and private companies accountable whenever they fail to respect human rights and the rule of law.

Our beloved stakeholders, Human Rights First

The main problem we attempted to solve was to help HRF tap into their least used but very promising resource, their data. The most obvious missing component was a database to store and retrieve all of this information, as well as a user-friendly interface to maintain it. Our goal for this project was therefore to create a web application, or hub, both the HRF team and their clients could easily access. This hub would be used to store and manage Asylum cases, and hopefully provide useful insights to help HRF swing rulings in favor of their clients.

From a technical standpoint the biggest challenge we foresaw was the raw ability of our hub to read and interpret data submitted by the user, which was actually a series of scanned documents uploaded in pdf format. To be clear, computers are machines capable of amazing things, unfortunately comprehensive reading is not one of them, especially when we feed them choppy, misaligned, blurry, redacted and wildly variant images of text.

The joys of programming

At this point you might be doubting the feasibility of our team delivering this project in time, especially since, for the most part we, are freshly-minted data scientists and web developers. But fear not, our team was the last of many who have worked on this project, which is to say a lot of the ground work was already in place. This was both a gift and a burden; in one hand we didn’t have to start from scratch, in the other, we spent a whole week trying to piece to together the convoluted repository left behind by previous programmers who left enough comments behind that we would need a Cryptographer to read their code.

The structure of the application was there, but most of the individual pieces were either broken or missing. Our team had a tough hurdle to jump over which was to consolidate all of the language/text recognition tools previously used into on, for the sake of efficiency. We ultimately agreed to stick with SpaCy’s Natural Language Processing library, and substituting whatever tool was already in place meant we had to rewrite virtually eveything. Luckily our team was made up of about a dozen data scientists, myself included. We discussed everything that needed to be done, eventually about half of us focused on developing useful visualizations while the rest of us worked on the text recognition.

My tasks ended up focusing around two main features, the first being the creation of a database with everything related to immigration judges and courts. The motivation behind this feature was for us to be able to cross reference the information extracted from the user-submitted pdf’s with some sort of universal-truth table. You would think this sort of database must already exist, my job would be to simply download it, maybe reformat it to be more useful for us, but alas I was wrong. I had to write a scraping tool in order to pull information from various websites and compile it all into a friendly format which would also contain all the information we needed. A bit tedious, but honestly I quiet enjoy working on these type of problems, it’s like playing tug of war with my computer! I also wanted this database to be easily updatable by any HRF admin user with the click of a button, so I decided to wrap my web-scraper algorithm into a python function which could simply be used as an endpoint attached to a simply “Update Judges” button on the user interface. Here’s my work of art:

def update_judges_table():"""This function pulls information from multiple url's to create atable pupulated with information on all immigration judges. Thistable is saved in the current directory as 'jedges_courts.csv'."""def get_courts(data):    courts = {i:str(j).split(' ')[-2] for i, j in 
zip(data[3]['State'],
data[3]['Circuit assignment(s)']) if
type(j) == str and i != 'District of Columbia'}
for i, j in courts.items(): if '0th' in j or '1th' in j: courts[i] = j[-5:-3] else: courts[i] = j[-4:-3] courts['District of Columbia']= 'District of Columbia Circuit' return courtsurl = 'https://en.wikipedia.org/wiki/United_States_courts_of_appeals'data = pd.read_html(url)courts = get_courts(data)def get_judges(tables): for i in tables: i.columns = i.iloc[0].values i.drop(0, inplace=True) i.reset_index(drop=True) df = tables[0] states =
df.columns[0].replace(' ',' ').replace('|', '').split()
for i in range(len(states)): if states[i].startswith('New') \ or states[i].startswith('North') \ or states[i].startswith('South') \ or states[i].startswith('Puerto'): states[i] = states[i] + ' ' + states[i + 1] states[i + 1] = '' if states[i].startswith('Northern'): states[i] = states[i] + ' ' + states[i + 2] states[i + 2] = '' while '' in states: states.remove('') tables[0] = states judges = tables[1:] for s, j in zip(states, judges): j['State'] = s df = pd.concat(tables[1:], ignore_index = True) df.rename(columns = {'Court Administrator':'court_admin', 'Immigration Judges':'judge', 'Address':'court_address', 'Court':'court_name', 'State':'court_state'}, inplace = True) df.dropna(axis = 1, inplace = True) def get_circuit(state): if state in courts:
# ...
# hundreds of lines later.... return judges_courts

Analogous to the Mona Lisa? Probably not but it did the job, and with more time I could easily improve its performance and aesthetics. Either way, this might be be a bit more informative:

First 10 judges, out of 507

I was also able to pull all relevant immigration laws, which served as a filter for another one of our text-recognition functions. All of these worked as intended which is always nice to see. I did have one last contribution in the final hours before our deadline, which was building the logic for a text-recognition function which aimed to extract whether the judge or court deemed the applicant or “Respondent” trustworthy or credible. Here’s how I accomplished that task:

matcher = Matcher(nlp.vocab)# This pattern was created from the available cases under 
# the redacted directory in our shared HRF google drive.
pattern = [
# narrow scope

[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "generally"},
{"LOWER": "credible"}],
[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "testimony"},
{"LOWER": "credible"}],
[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "credible"}],
# medium scope [{"LOWER": "credible"}, {"LOWER": "witness"}], [{"LOWER": "generally"}, {"LOWER": "consistent"}], [{"LOWER": "internally"}, {"LOWER": "consistent"}], [{"LOWER": "sufficiently"}, {"LOWER": "consistent"}], [{"LOWER": "testified"}, {"LOWER": "credibly"}], [{"LOWER": "testimony"}, {"LOWER": "credible"}], [{"LOWER": "testimony"}, {"LOWER": "consistent"}], # wide scope [{"LOWER": "coherent"}], [{"LOWER": "plausible"}]
]
matcher.add('credibility', pattern)
def similar(target_phrases, file): matcher = Matcher(nlp.vocab) matcher.add('target_phrases', target_phrases) return matcher(file, as_spans=True)def get_credibility(self): """
Returns whether or not the Respondent was identified as credible
by their assigned judge / court.
"""
pattern = [
# narrow scope
[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "generally"},
{"LOWER": "credible"}],
[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "testimony"},
{"LOWER": "credible"}],
[{"LOWER": "court"}, {"LOWER": "finds"},
{"LOWER": "respondent"}, {"LOWER": "credible"}],
# medium scope [{"LOWER": "credible"}, {"LOWER": "witness"}], [{"LOWER": "generally"}, {"LOWER": "consistent"}], [{"LOWER": "internally"}, {"LOWER": "consistent"}], [{"LOWER": "sufficiently"}, {"LOWER": "consistent"}], [{"LOWER": "testified"}, {"LOWER": "credibly"}], [{"LOWER": "testimony"}, {"LOWER": "credible"}], [{"LOWER": "testimony"}, {"LOWER": "consistent"}], # broad scope [{"LOWER": "coherent"}], [{"LOWER": "plausible"}]
]
similar_cred = similar(target_phrases=pattern, file=self.doc) if similar_cred: # following code adds matches to a list,
# may or may not be needed by front end.
#
# cred = []
# for phrase in similar_cred:
# if phrase.text.lower() not in cred:
# cred.append(phrase.text.lower())
return 'Respondent was found credible' return 'Respondent was not found credible'

There is no great visual for the function above, but in essence it populates a field specifying whether the Respondent was deemed credible after a user uploads a document. The code might look a bit un-impressive, that is because SpaCy is a magnificent tool which does most of the heavy lifting.

Curtain Call

Today is delivery day, and I am proud to say our team was able to ship our product with the following features:

  • Dynamic Data Visualizations
  • Structured Field Extraction
  • UI/UX
  • Case File Upload Queue
  • Bulk Case File Upload

In the future, as data starts to accumulate, I predict more visualization and data analysis features will be added to the hub, and it would provide even better insights regarding how specific details about an application and/or a Respondent affect how a specific judge will rule. The main obstacle for these improvements is of course the acquisition of data, and the availability of past rulings; at the moment these records are incredibly difficult to acquire.

All in all this month was challenging not only because of the complexity behind getting computers to read comprehensively, but also because every decision and change had to be reviewed and approved by our peers, supervisors, and ultimately the stakeholders(HRF). In my opinion our ability to work in this team setting as well as to communicate in a clear and effective manner were by far the most import skills which lead to our success.

I was lucky enough to receive feedback from my supervisors, so I plan to really level up my communication skills as well as my confidence to be the best teammate and data scientist I can be. Now with my first real-world experience in this industry in the rear-view mirror, I cannot wait for what lies ahead!

Questions? Just want to chat? Leave a comment!

Or reach me here: LinkedIn GitHub

--

--

Ricardo Rodriguez
0 Followers

Data Scientist with a passion for all-things Science and Computing!