LSTM-Based Question Answering on News Articles

sparshrestha1 pts1 comments

GitHub - sparshrestha/NewsQA-LSTM: LSTM-based Question Answering system on News Articles. Includes pipeline for data ingestion, BiLSTM retriever, and LSTM+attention reader with citations (URL, Headline, Date). Part of a coursework at Kathmandu University. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

sparshrestha

NewsQA-LSTM

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>57 Commits<br>57 Commits

week0_news_scraper

week0_news_scraper

week1_ingestion_cleaning

week1_ingestion_cleaning

week2_bilstm_encoder

week2_bilstm_encoder

week3_lstm_attention

week3_lstm_attention

week4_inference

week4_inference

.gitignore

.gitignore

Group_6_Mini_Project_Slides.pdf

Group_6_Mini_Project_Slides.pdf

LICENSE

LICENSE

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

View all files

Repository files navigation

NewsQA-LSTM

A work-in-progress project to build a Question Answering system for news articles using LSTM-based models.

This system will:

Ingest & clean hundreds of news articles (with metadata).

Retrieve relevant passages using a BiLSTM-based retriever + FAISS/ScaNN index.

Read & extract short answers with an LSTM + attention reader.

Return answers with confidence and source citations (URL, headline, date).

Fallback to "not found" when confidence is low.

Training Corpus

Source: Crawl news articles from allowed RSS feeds / APIs / websites.

Size: Hundreds of articles → at least 5,000 unique passages after cleaning.

Preprocessing:

Normalize Unicode & remove boilerplate text (ads, nav, etc.).

Split into 200–400 token passages with ~50 token overlap.

Deduplicate near-identical passages using shingling + MinHash (target ≥10% reduction).

Metadata: Each passage stores publisher, URL, headline, date, and detected entities.

Setup Usage

git clone git@github.com:sparshrestha/NewsQA-LSTM.git

python -m venv venv

source venv/bin/activate

pip install -r requirements.txt

Tech Stack

Python 3.10+

PyTorch (for BiLSTM & reader)

FAISS or ScaNN (for ANN search)

BeautifulSoup / Newspaper3k (for news crawling & cleaning)

Retrieval Model (BiLSTM)

Embeddings: Word embeddings (GloVe / fastText) + optional char-CNN.

Encoder: BiLSTM → pooling (mean/max/attentive) → fixed vector.

Training: Contrastive / triplet loss with (query, positive, negative) triplets.

Index: FAISS or ScaNN ANN index, storing passage IDs for citations.

Evaluation:

Recall@20: Fraction of queries with relevant passage in top-20.

MRR: Mean reciprocal rank of correct passage.

Reader Model (LSTM + Attention)

Architecture: BiLSTM + attention over passage conditioned on question.

Output: Start and end span predictions (softmax).

Training:

Pretrain on SQuAD-style dataset (Wikipedia).

Fine-tune via distant supervision on news (when gold answers appear in passages).

Evaluation Metrics:

EM (Exact Match): % answers exactly equal to gold.

F1: Overlap of predicted vs gold answer tokens.

Inference Pipeline

Preprocess input question (tokenization, entities, date hints).

Retrieve top-k=20 passages via retriever.

Re-rank with cross-encoder (question ⊕ passage), keep top-m=5 .

Run reader on top-m passages to extract candidate answer spans.

Select highest-confidence span → return answer + confidence + (URL, headline, date) .

If confidence

About

LSTM-based Question Answering system on News Articles. Includes pipeline for data ingestion, BiLSTM retriever, and LSTM+attention reader with citations (URL, Headline, Date). Part of a coursework at Kathmandu University.

Topics

machine-learning

natural-language-processing

news

neural-network

artificial-intelligence

lstm

bidirectional-lstm

Resources

Readme

License

MIT license

Uh oh!

There was an error while loading. Please reload this page.

Activity

Stars

stars

Watchers

watching

Forks

forks

Report repository

Releases

No releases published

Contributors

Uh oh!

There was an error while loading. Please reload this...

lstm news question articles bilstm date

Related Articles