Search R Questions on Stack Overflow with Python
Build and deploy a knowledge base search app to Jina AI Cloud in under 50 lines of code and 10 minutes. Here is how you do it.
In this article I'm deploying a simple Jina app to the Jina AI Cloud. This app allows users to input a text query and then retrieve question-answer pairs, where the question semantically matches the given query. It's a pretty straightforward search app, and I want to use it to demonstrate how easy it is to take such apps from zero to ready-to-serve in 10 minutes.
Wait…for R? Not Python?! Well, the code itself is in Python. Only the dataset is related to R questions.
try...excepts
didn't help much, so I decided to spend time on a dataset that didn't have such problems. The R dataset was next on the list and it works fine.Our tech stack
To build our search app we'll use:
DocArray
Everything going in or out of a Jina Flow has to be a Document, (Documents are Jina's primitive data type.) This comes from the DocArray package, which is ideal for working with unstructured data.
Jina
Jina is a framework that lets us build cross-modal and multi-modal applications on the cloud.
In short we'll use a Jina Flow and some Executors (from Executor Hub) to process and create vector embeddings for our Stack Overflow questions.
Jina AI Cloud
Jina AI Cloud is a free-hosting platform for Jina applications. We can then index and query data without using any of our own compute.
How it works
Build & Deploy
We first build a Jina Flow, deploy it on the Jina AI Cloud. Later we will use the Jina Client to send data for indexing and searching.
Write the Flow
To do this we simply need our Flow in YAML format:
jtype: Flow
with:
protocol: grpc
executors:
- name: encoder
uses: jinahub+docker://SpacyTextEncoder/
uses_with:
model_name: 'en_core_web_md'
jcloud:
resources:
memory: 8G # encoding is hungry. add more memory
- name: indexer
uses: jinahub+docker://AnnLiteIndexer/
uses_with:
n_dim: 300 # model uses this many dimensions
uses_metas:
workspace: workspace
jcloud:
capacity: on-demand
resources:
memory: 8G # you can never have too much memory!
You can see two Executors in our Flow, namely:
- SpacyTextEncoder — uses spaCy to create vector embeddings for our questions.
- AnnLiteIndexer — stores and retrieves our embeddings and metadata.
In short, each Executor takes our questions as input/output. We’re wrapping them in jinahub+docker://...
so we can pull them from Executor Hub (Jina’s Executor “app store”) and run them in Docker in our Flow.
Run:
jina export flowchart flow.yml flow.svg
Our Flow looks like this:
We also pass some extra parameters about which model we want to use for embedding, resources, etc.
Log in to Jina AI Cloud
To deploy a Flow to Jina Cloud, you'll need to create a Jina AI Cloud account with your email, Google, or GitHub login.
Deploy the Flow
We can deploy the Flow to the cloud with:
jina cloud deploy flow.yml
Note: If you have trouble deploying, you can view more verbose output with --loglevel=DEBUG
:
jina cloud --loglevel=DEBUG deploy flow.yml
After running the command it’ll take a few minutes to deploy and then return something like:
╭───────────────────── 🎉 Flow is available! ──────────────────────╮
│ │
│ ID 8412cf3e8b │
│ Endpoint(s) grpcs://8412cf3e8b.wolf.jina.ai │
│ Dashboard https://dashboard.wolf.jina.ai/flow/8412cf3e8b │
│ │
╰──────────────────────────────────────────────────────────────────╯
Make a note of your Endpoint (in our case grpcs://8412cf3e8b.wolf.jina.ai
.) You’ll need it later.
Indexing the data
Jina Flows use Documents and DocumentArrays for input and output. So we need to put our questions into a DocumentArray.
First let’s see what fields we have to work with, using head -n 1 Questions.csv
(from our R dataset):
Id,OwnerUserId,CreationDate,Score,Title,Body
Since we’re building a search engine we need to choose which field to search, namely the Title
field. We can do this easily with .from_csv()
method from DocumentArray
:
from docarray import DocumentArray
docs = DocumentArray.from_csv('Questions.csv', field_resolver={'Title': 'text'})
All other fields (Score
, Body
, etc) are automatically stored in Document.tags
as a Python dict
.
Indexing took about 14 minutes for all 6,304,085 questions in the CSV file.
Send DocumentArray to the Flow for indexing
For this we use Jina Client. Remember to plug in your own endpoint that you got from the previous step!
from jina import Client
client = Client(host='grpcs://8412cf3e8b.wolf.jina.ai') # Your gateway from earlier
client.index(docs, show_progress=True)
Searching the data
Create search Document
Since everything going in and out of a Flow is a Document or DocumentArray, we’ll need to wrap our search term in a Document:
from docarray import Document
search_doc = Document(text='statistic visualization')
Pass the search Document to the Flow
Once again, we can use the Jina Client:
response = client.search(search_doc)
View the results
Jina returns yet another DocumentArray as the result (as I said, it’s Documents and DocumentArrays for everything). This consists of a single Document with several Document.matches
. We can see the text (i.e. the question title) of these with:
print(response[0].matches.texts)
Given the query term statistic visualization
we get the following output:
['Hotellings statistic', 'Create multivariate similarity graph', 'How to vectorize extracting significant predictor variables?', 'visualization of correlation between replicates', 'R How to visualize this categorical percentage data?', 'visualizing statistical test results with ggplot2', 'Calculate Tanimoto coefficient', 'multivariate regression', 'Visualize aggregate data using ggplot2', 'How to calculate marginal probabilities for generating correlated binary variables']
Or in a more readable way:
- Hotellings statistic
- Create multivariate similarity graph
- How to vectorize extracting significant predictor variables?
- visualization of correlation between replicates
- R How to visualize this categorical percentage data?
- visualizing statistical test results with ggplot2
- Calculate Tanimoto coefficient
- multivariate regression
- Visualize aggregate data using ggplot2
- How to calculate marginal probabilities for generating correlated binary variables
More user-friendly
By running our Flow on Jina AI Cloud it’s already most of the way there when it comes to production readiness (depending on your use case). The main thing to do next would be to integrate a nice frontend so users can interact directly.
For inspiration, you can see my Streamlit frontend
folder in the project’s repo:
Essentially it acts as a wrapper for:
- Jina Client — retrieves matching questions given a search term.
- SQLite — retrieves matching answers based on the questions that come up.
🎉 That's it!
...Or is it?
This is running on gRPC. How can we use a RESTful interface?
I chose to run on gRPC because it’s more efficient that HTTP. That said, Jina also suports HTTP and WebSocket gateways. To deploy a RESTful Flow:
- Set
protocol
tohttp
in your Flow YAML before deploying. - Set a lower
request_size
to compensate for HTTP's lower efficiency.
But what about the answers?
I decided for this quick example to just store the answers in a SQLite database and retrieve them via a function in the frontend. The database is stored on the same machine as my frontend code so access is easy.
I created the database using the awesome csv_to_sqlite Python package.
Why not store the answers as <foo> or <bar> or …?
Essentially the answer to many of these questions is:
- For this example I’m only searching the question text, so the only interaction the Flow would have with the answers would be storing them in the index anyway, without embeddings.
- That would mean uploading 237mb to the cloud. I remember the days of 1.44mb floppy disks, so 237mb seems like a lot to my wrinkly old self.
- So why not just put that data in a SQLite file? It seems easy enough, and there are no extra dependencies since the
sqlite3
library comes with Python anyhow.
But if we did want to process the answers…?
If we just wanted to store the answers in the index we could write a function to embed each question’s answer in Document.tags["answers"]
as a list element.
If we wanted to create and search embeddings for the answers we could create a sub-Document for each answer in a question’s Document.chunks
and then decide what access_path
we want to use for searching.
Either way, it would’ve meant writing a function to match questions and answers from two CSV files. Considering the trouble I’d already had with the Python Stack Overflow CSV file above, I decided I’d had enough of CSVs for one day. Maybe in a future example.
How can we get better results?
Since Stack Overflow is all about code, you may want to use TransformerTorchEncoder
with a more language-specific model (like roberta-python for Python. I tried finding one for R but turns out a query that is a language name with just one letter doesn’t play well with Hugging Face’s search engine…)
How can we use Jina to search the actual Stack Overflow, not just a dataset?
Good luck with that! You’d need to scrape the whole site, which Stack Overflow may not be too happy about. That would be a lot of work and potentially legal headaches. That's not to mention keeping it up-to-date. The actual Stack Overflow is constantly updated. So you might need to invest in some compute!