GitHub - sparshrestha/NewsQA-LSTM: LSTM-based Question Answering system on News Articles. Includes pipeline for data ingestion, BiLSTM retriever, and LSTM+attention reader with citations (URL, Headline, Date). Part of a coursework at Kathmandu University. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
sparshrestha
NewsQA-LSTM
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>57 Commits<br>57 Commits
week0_news_scraper
week0_news_scraper
week1_ingestion_cleaning
week1_ingestion_cleaning
week2_bilstm_encoder
week2_bilstm_encoder
week3_lstm_attention
week3_lstm_attention
week4_inference
week4_inference
.gitignore
.gitignore
Group_6_Mini_Project_Slides.pdf
Group_6_Mini_Project_Slides.pdf
LICENSE
LICENSE
README.md
README.md
main.py
main.py
requirements.txt
requirements.txt
View all files
Repository files navigation
NewsQA-LSTM
A work-in-progress project to build a Question Answering system for news articles using LSTM-based models.
This system will:
Ingest & clean hundreds of news articles (with metadata).
Retrieve relevant passages using a BiLSTM-based retriever + FAISS/ScaNN index.
Read & extract short answers with an LSTM + attention reader.
Return answers with confidence and source citations (URL, headline, date).
Fallback to "not found" when confidence is low.
Training Corpus
Source: Crawl news articles from allowed RSS feeds / APIs / websites.
Size: Hundreds of articles → at least 5,000 unique passages after cleaning.
Preprocessing:
Normalize Unicode & remove boilerplate text (ads, nav, etc.).
Split into 200–400 token passages with ~50 token overlap.
Deduplicate near-identical passages using shingling + MinHash (target ≥10% reduction).
Metadata: Each passage stores publisher, URL, headline, date, and detected entities.
Setup Usage
git clone git@github.com:sparshrestha/NewsQA-LSTM.git
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Tech Stack
Python 3.10+
PyTorch (for BiLSTM & reader)
FAISS or ScaNN (for ANN search)
BeautifulSoup / Newspaper3k (for news crawling & cleaning)
Retrieval Model (BiLSTM)
Embeddings: Word embeddings (GloVe / fastText) + optional char-CNN.
Encoder: BiLSTM → pooling (mean/max/attentive) → fixed vector.
Training: Contrastive / triplet loss with (query, positive, negative) triplets.
Index: FAISS or ScaNN ANN index, storing passage IDs for citations.
Evaluation:
Recall@20: Fraction of queries with relevant passage in top-20.
MRR: Mean reciprocal rank of correct passage.
Reader Model (LSTM + Attention)
Architecture: BiLSTM + attention over passage conditioned on question.
Output: Start and end span predictions (softmax).
Training:
Pretrain on SQuAD-style dataset (Wikipedia).
Fine-tune via distant supervision on news (when gold answers appear in passages).
Evaluation Metrics:
EM (Exact Match): % answers exactly equal to gold.
F1: Overlap of predicted vs gold answer tokens.
Inference Pipeline
Preprocess input question (tokenization, entities, date hints).
Retrieve top-k=20 passages via retriever.
Re-rank with cross-encoder (question ⊕ passage), keep top-m=5 .
Run reader on top-m passages to extract candidate answer spans.
Select highest-confidence span → return answer + confidence + (URL, headline, date) .
If confidence
About
LSTM-based Question Answering system on News Articles. Includes pipeline for data ingestion, BiLSTM retriever, and LSTM+attention reader with citations (URL, Headline, Date). Part of a coursework at Kathmandu University.
Topics
machine-learning
natural-language-processing
news
neural-network
artificial-intelligence
lstm
bidirectional-lstm
Resources
Readme
License
MIT license
Uh oh!
There was an error while loading. Please reload this page.
Activity
Stars
stars
Watchers
watching
Forks
forks
Report repository
Releases
No releases published
Contributors
Uh oh!
There was an error while loading. Please reload this...