➴➴➴Æ🜔Ɲ.Ƈꭚ⍴𝔥єɼ👩🏻💻<p>Okay, Back of the napkin math:<br> - There are probably 100 million sites and 1.5 billion pages worth indexing in a <a href="https://lgbtqia.space/tags/search" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>search</span></a> engine<br> - It takes about 1TB to <a href="https://lgbtqia.space/tags/index" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>index</span></a> 30 million pages.<br> - We only care about text on a page.</p><p>I define a page as worth indexing if:<br> - It is not a FAANG site<br> - It has at least one referrer (no DD Web)<br> - It's active</p><p>So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.</p><p>My size assumptions are basically as follows:<br> - <a href="https://lgbtqia.space/tags/URL" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>URL</span></a><br> - <a href="https://lgbtqia.space/tags/TFIDF" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TFIDF</span></a> information<br> - Text <a href="https://lgbtqia.space/tags/Embeddings" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Embeddings</span></a><br> - Snippet </p><p>We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.</p><p>Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with <a href="https://lgbtqia.space/tags/quantized" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>quantized</span></a> embeddings, you can only fit 2 million per GB in ram. </p><p>Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, <a href="https://lgbtqia.space/tags/FAISS" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FAISS</span></a> to sort those, load snippets dynamically, potentially modify rank by referers etc.</p><p>6 128 MG <a href="https://lgbtqia.space/tags/Framework" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Framework</span></a> <a href="https://lgbtqia.space/tags/desktops" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>desktops</span></a> each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace <a href="https://lgbtqia.space/tags/Google" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>Google</span></a>. That's about $15k. </p><p>In two to three years this will be doable on a single machine for around $3k.</p><p>By the end of the decade it should be able to be run as an app on a powerful desktop</p><p>Three years after that it can run on a <a href="https://lgbtqia.space/tags/laptop" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>laptop</span></a>.</p><p>Three years after that it can run on a <a href="https://lgbtqia.space/tags/cellphone" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>cellphone</span></a>.</p><p>By #2040 it's a background process on your cellphone.</p>