I recently completed a project for Human Rights First, a non-profit dedicated to providing legal representation for refugees. Their requirements were to take write-ups of appeal documents and get useful information from them such as the name of the judge, the court location, and the defendant's country of origin. This writeup is a story of my path to making usable code rather than a strict technical specification of the code, but I have linked the repository at the bottom of the article if you just want to see that.
In classic computer science fashion, trivially easy parts ended up being ridiculously hard, while the sections that would be impossible to non-programmers ended up being fairly easy. Specifically, figuring out the code to convert and scrape the documents was not so bad, while figuring out how to run the code on a place that was not my laptop turned out to be much harder.
Let's start by looking at what our customer wants, and how we translated that into technical requirements. This is one of the softer parts of computer science, where you need feedback from key stakeholders, end-users, and project managers to figure out what they want, what can be done, and any important constraints. At this point, most of the people involved in the project are non-technical, so I made sure to taboo words like “API”. If you have ever been in a conversation where two people are speaking a language that you cannot speak, you can understand the frustration of having a computer scientist use technical terms.
What the customer wanted to do was translate thousands of legal documents of past refugee cases into usable information. There have been so many refugee cases, that reading and remembering all of them would be infeasible. The customer wanted the name of the judge from each case, to figure out if certain judges were more likely to reject cases, as well as court location to see if certain locations were more likely to reject cases. In an ideal world, the answer to the previous two questions would be ‘of course not’ but it is good to check to see whether or not you live in that particular world.
The first problem was that the documents had all been uploaded to Scribd, instead of just available as a directory. This was a problem because Scribd throttles your ability to download documents programmatically. Specifically, if you go to a Scribd web page containing a document, and attempt to inspect the document on the page CSS, then you will notice that the document is composed of snippets of special characters and images of nothing. I reached out to Scribd to see if I could get the documents in bulk, but they said no. I also tried to find out how the client had initially gotten those documents, and who uploaded them to Scribd, but that piece of organizational knowledge had been lost. We ultimately just had to click a lot of times to get all of the documents that we wanted, so that was unfortunate.
The next problem was that all of the documents were in PDF format. The human eye cannot tell the difference between a PDF document displaying text, and a text document itself. However, PDFs and text are very different to computers, with PDFs being saved as formatted byte strings. In order to convert all the PDFs to text, we needed to use several third-party libraries including Poppler and Pytesseract. I did not realize it at the time, but these libraries are huge, and take a long time to run, which created our first blocker. I was working locally at the time, so I did not realize the problem that this would present later when we were working with AWS.
Finally, running code on my laptop is all well and good, but we needed a working API in the cloud so that I would not have to keep my laptop running forever on the chance that my code needed to be run. To solve this problem we initially used Amazon Elastic Beanstalk, an API hosting service. Getting our API up and running was a good experience. I have to commend Amazon for having very clear documentation on how to use their services. However, we started running into problems almost immediately when testing to see if we could send a real appeal document to our API. If we sent a very small document, our code would run with no issue. A medium-sized document would result in a memory error, and a large document would result in a gateway timeout. Medium-sized documents were too fat, while large documents were too long.
I managed to make several optimizations to our code, which resulted in memory errors going away. Specifically, I was attempting to convert the entire PDF in one function, which resulted in each page being saved in memory before being converted to text. Algorithms people would call this O(N) space complexity because the space required by the program grows in proportion to the size of the input. Technical interviewers would call this “Great try, but you’re not working here”. By converting a single page from the PDF at a time, I reduced the space complexity to O(1), meaning that the program only took a constant amount of space with each run.
That left the problem of long documents. I was having a lot of trouble trying to figure out gateway timeout errors. It seemed like no matter what setting or timeout that I applied, the API would time out after 15 seconds. I got bad, but clarifying evidence that the size of the instance was throttling me when I increased the instance size. After that, the API would time out after 60 seconds. When I tested the code on a 10-page document running locally and in Google Colab, I found that the total conversion time was about 120 seconds. I would need a really big instance to get all documents to work correctly. I needed an alternate solution.
AWS Lambda is a serverless function host, which scales in response to need. I decided to use that instead since I really did not need a full website, but only a single responsive function running somewhere. Additionally, timeouts are up to 15 minutes on Lambda, so I would remove the cause of the 504 errors.
Ultimately, the customer was impressed with our product and is playing around with it at the time of this writing. I’m sure they will discover several very interesting ways to break my code, which will be appropriately embarrassing. Once I’ve fixed, and cursed at, our system, then we will have made our justice system slightly more efficient, which is awesome!
Make a connection: