Some code I played around with for reprocessing user and post positions

x0x71 pts0 comments

Some code I played around with for reprocessing user and post positionsCustom ▼hotnewtopbump<br>Log in

Score1Age1 Proximity1Bump1Comments1

Some code I played around with for reprocessing user and post positions<br>By: x0x7

Background and Context:

This site was started to play around with one of the initial ideas Aaron Swartz had for Reddit. His initial idea was that upvotes and downvotes would customize what content you saw. This was very novel at the time. Ultimately Reddit never got to that milestone within Swartz's lifetime. As such, posts and users have positional data attached to them on this site. The first version of this was very simple. The user and post positions are updated any time you vote on something. You go closer to the post, and the post goes closer to you. The opposite happens when there is a downvote. There is a bit more code to avoid "identity collapse" where everything would become a singularity from people upvoting more than downvoting. This is a part of the sorting algorithm here. The standard sort here is called "custom," which allows users to control the weights used in the sort algorithm, and proximity is one factor they can either maximize or minimize.

But I wasn't sure the original method developed for the site was giving the best results, so I decided to try using a bit of machine learning to back process all the data we have so far. It did improve the positions quite a bit. And I think those improvements have held even with the system continuing to process with the old method with votes that have happened since I ran this.

The method I used treats the existing positions as model parameters. We have a gradient function that wants to drive upvoted posts and users closer, and negative posts and users further apart. Because many users don't downvote, and I needed this to have some balance, I also estimated when these posts would have been visible. If the user has an upvote around the same time as an ignored post would have been visible, I marked this as a minor downvote. This somewhat works because this is a small site where it's easy to see every post. But a bit of contra-sample noise can't hurt anyway.

There are three main programs involved. One extracts data from an database to a dataframe that is ready for machine learning. The next program optimizes user and post positions in a 10D Hilbert space using autograd and gradient descent. This runs on a GPU. The last converts the 10D result into a principal component analysis (PCA) and pushes it back into a database format.

import sqlite3<br>import pandas as pd<br>import numpy as np<br>import gc<br>import shutil<br>import os

def prepare_data():<br>print("Loading SQLite Data in read-only mode...")<br>conn_data = sqlite3.connect("file:data.db?mode=ro", uri=True)<br>conn_matrix = sqlite3.connect("file:matrixdb.db?mode=ro", uri=True)

# Load items (posts and comments)<br>print("Loading items...")<br>df_tags = pd.read_sql_query("SELECT tag as id, time FROM tag", conn_matrix)<br>df_comments = pd.read_sql_query("SELECT comment as id, time FROM comment", conn_data)<br>df_items = pd.concat([df_tags, df_comments]).drop_duplicates(subset=['id']).reset_index(drop=True)

# Load votes<br>print("Loading votes...")<br>df_votes = pd.read_sql_query("SELECT user, post as item, vote, time FROM votestate", conn_data)

# Clean up unneeded connections<br>conn_data.close()<br>conn_matrix.close()

# Filter votes to only include known items (if any are missing)<br>df_votes = df_votes[df_votes['item'].isin(df_items['id'])]

# Map Users and Items to consecutive Integer IDs<br>print("Mapping IDs...")<br>unique_users = df_votes['user'].unique()<br>user2idx = {u: i for i, u in enumerate(unique_users)}<br>idx2user = {i: u for u, i in user2idx.items()}

unique_items = df_items['id'].unique()<br>item2idx = {item: i for i, item in enumerate(unique_items)}<br>idx2item = {i: item for item, i in item2idx.items()}

df_votes['u_idx'] = df_votes['user'].map(user2idx)<br>df_votes['i_idx'] = df_votes['item'].map(item2idx)<br>df_items['i_idx'] = df_items['id'].map(item2idx)

print(f"Total users: {len(unique_users)}")<br>print(f"Total items: {len(unique_items)}")<br>print(f"Total explicit votes: {len(df_votes)}")

# Inferred neutral votes (0)<br># For each user vote, sample 1 item that was posted around the same time (within +/- 3 days)<br># We will do this via a time-sorted array of items<br>print("Inferring neutral votes based on time proximity...")<br>df_items = df_items.sort_values('time').reset_index(drop=True)<br>item_times = df_items['time'].values<br>item_idxs = df_items['i_idx'].values

# Create a set of (u_idx, i_idx) for fast lookup<br>existing_votes = set(zip(df_votes['u_idx'], df_votes['i_idx']))

neutral_votes = []

# 3 days in seconds = 3 * 24 * 3600 = 259200<br>TIME_WINDOW = 259200

# To speed up, we'll iterate through votes and randomly pick an item within the time window<br># We'll use np.searchsorted<br>import random

for row in df_votes.itertuples():<br>u = row.u_idx<br>v_time = row.time # The time the *vote* happened.<br># Wait, the prompt says "if a user voted on one...

time df_votes user post df_items item

Related Articles