Forum: War Ensemble BBS

Search 101: How to Find the Lost Web

From Ben Collver@bencollver@tilde.pink to comp.misc on Tue Jul 8 14:11:46 2025

From Newsgroup: comp.misc

Search 101: How to Find the Lost Web by Bob Leggitt ===================================================

Privacy activists have a heightened sense of the way censorship
works hand in hand with surveillance to build the classic picture
of Nineteen Eighty-Four. And when we know a search engine is
capable of giving us accurate, relevant results, but doesn't, we
realise we're seeing a form of censorship.

Google's lost the internet. You might have seen a few complaints.
Whether they've come courtesy of anons in the underbelly of the
Fediverse, or a viral soundbyte from Edward Snowden, a growing
catalogue of gripes is asserting that web search is no longer fit for
purpose. Well, unless web search's purpose is to detect capitalism.
In which case thumbs up. The search engines are better than ever at
that. They now surface ecommerce, ad-tech, and affiliate-pumped
listicle hell so reliably that we barely even need to enter a search
term.

<https://twitter.com/Snowden/status/1460666075033575425>

But the internet we used to know and love, brimming with offbeat gems
from passionate authors... That's gone missing. And with it, the
humour. The imagination. The individuality... Maybe we've just
forgotten how to use a search engine?...

Nope, it's definitely not us. Change the nuance of the query. Add a
tail. Use quotes... It doesn't seem to matter anymore. We get the
same list of crap based on one or two commercially-associable
keywords, more or less whatever else we type. What we won't get, is
what we actually searched for. If we're looking for a specific piece
of information, the average web search engine is going to ignore the
specifics and hammer us with a scripted smorgasbord of abject
capitalism, augmented with a token entry from Wikipedia--frequently
the only useful result.

The value of being able to filter spam chronologically is immense,
and it can completely demolish virtually any myth built by the
information machine.

But the thing is, Wikipedia has its own search engine, which is
mercifully devoid of results from Amazon, eBay, and an army of affiliate-drones' listicle sites. So, if Wikipedia is often the only
genuinely useful and non-commercial resource we're finding in the
visible output of a web search engine, why are we still running to
the likes of Google, Bing and DDG as a first resort? Why do we not
just use Wikipedia Search as a basic source of general knowledge
results, and supplement that with a range of other search facilities
which are at least attempting to give us what we ask for?

<https://wikipedia.org/wiki/Special:Search>

<https://popzazzle.blogspot.com/2021/07/ duckduckgo-gets-blocked-by-privacy-protection-routine.html>

That's exactly what this article is going to suggest.

DO WE REALLY NEED TO REPLACE WEB SEARCH? ========================================

It's no coincidence that many of the complaints about search quality
are coming from privacy activists. Privacy activists have a
heightened sense of the way censorship works hand in hand with
surveillance to build the classic picture of Nineteen Eighty-Four.
And when we know a search engine is capable of giving us accurate,
relevant results, but doesn't, we realise *we're seeing a form of
censorship*.

Search engines have nannied us for a long time, assuming by default
that we mistyped any query that isn't verified as a popular topic.
But we're beyond that now. We're no longer in the realm of
"Are you sure you want that?". We've descended into...

No, you don't want that.

"Yes I do."

No you don't.

"Yes I do."

No you don't.

Even if we still regard this as heavy nannying rather than
censorship, do we really want to go through that ever-lengthening
argument every time we run a web search? Just for the sake of our
"framework of mind", as Dickie Valentino put it in the cult 1994
movie There's No Business..., we have to start finding better ways to
access information. It's unlikely we'll break free from major web
search engines entirely, but now is definitely the time to start
reducing our dependency on them.

Once upon a time you could search Google for the world's most
useless product and dive into a motley collection of chucklesome
mock ads--topped, if I remember rightly, by an enthusiastic
promotion for... Well, is it a screw? Is it a nail? No - it's a
scrail! Still makes me laugh today. But that same search produces
wall to wall e-corp listicles now. And if you search for scrails
you get Amazon trying, in all seriousness, to flog you an actual
packet of scrails. It's like, is there any Google search at all you
can now run that doesn't eventually lead to Amazon?"

FOR ACCURATE INFO, TWITTER SEARCH IS NOW MORE USEFUL THAN GOOGLE ================================================================

It's almost incomprehensible that the worldbeating sophistication of
Google Search could regress so far as to allow a micro-blogging site
to provide more relevant information, but that's where we are. And
one of the main reasons Twitter Search has become more popular than
Google with many people who research for a living, is the platform's
rigid protection of chronological integrity.

There are three components to this...

* All Tweets are dated and uneditable.

* Twitter Search allows us to define a date range.

* One of the best ways to find a relevant search result is to filter
out spam, and spam tends to come in waves, which are based on
trends and current affairs. In other words, a reliable date filter
can serve as a reliable spam filter.

When something becomes a talking point, the search results are
overwhelmed by spammers, news sites, megablogs, etc, jumping on that
talking point in a bid for search traffic. They know everyone is
looking for info on that subject, so they produce content about it
whether or not they have anything to say. This vast glut of very high
ranking domains then squeezes out all of the previous results, and
that usually makes finding previously published information through
typical web search methods incredibly difficult--if not impossible.
Some of these assaults of skimpily-researched verbal diarrhoea
actually end up changing history, as the public accept bone idle
journalism as truth and the reality is buried out of sight.

But Twitter's Advanced Search allows us to cut through the spam by
defining a date range in a search query. Because no one can
manipulate the dates of Tweets, or any of the information contained
within them, filtering out the period of the spam assault can
completely remove the spam.

<https://twitter.com/search-advanced>

For example, if you want to know what the consensus on vaccination
was in summer 2019 before the covid pandemic, a web search engine
will overwhelm you with spam about the covid vax. But if you search
on Twitter, limiting the date range to summer 2019, you will get
precisely the consensus that existed at the time, and nothing else.
There's no contamination, because no one can fake a summer 2019
Tweet. If the Tweet is dated June, July or August 2019, then that's
when it was published, and its content is what it contained back
then. This is very different from the output of web search engines,
where random third parties control the information sources, and you
have all manner of people manipulating both post content and post
dates in a bid to win traffic.

BUSTIN' MYTHS
=============

The value of being able to filter spam chronologically is immense,
and it can completely demolish virtually any myth built by the
information machine. In the 2010s I got curious as to the origins of
the Twitter hashtag. I wanted to know who invented the idea.
Wikipedia and a clutch of other sites assured me it was Chris
Messina. How predictable, I thought. Twitter hashtag invented by a high-profile, privileged dude with connections galore. But my life
experience told me that *well-connected dudes with high public
profiles are much better associated with taking credit for inventions
than actually inventing them*. So I decided to check out the story on
Twitter itself.

Because of Twitter's chronological integrity and the fact that I
could restrict the period of investigation to a time before Messina's
claim, I was able to establish that Messina did not in fact invent
the Twitter hashtag. I wrote a post documenting the truth back in
2016. Sadly, it's been one of the least visited posts I've ever
written. The search engines are quite happy with wall to wall
regurgitations of the Wikipedia line. But the post does demonstrate
how much more accurate Twitter can be as an information source than a
typical web search engine. And whilst single Tweets are limited (by character-count) in their ability to elaborate on a story,
collectively they can prove extremely thorough in the picture they
provide.

<https://twirpz.tumblr.com/post/676456221578035200/ twitters-real-hashtag-pioneers>

These obscure search engines are incredibly refreshing to use,
because they deliberately punish the exact, cash-crazed ideology
that Google goes out of its way to reward.

Twitter also affords us a directional filter on information. By
default, we only really see what influential voices are saying. But
we can filter a Twitter Advanced Search to show only the replies TO
those influential voices. That directional filter can serve as an
ideological filter and quickly take us to the opposing views which a
web search engine can easily hide.

This works brilliantly where marketing or propaganda is strong. For
example, a brand is only ever going to tell you what it gets right.
Never what it gets wrong. The brand will typically use SEO strategies
with web search engines, to ensure that its official messaging
occupies the whole front page, and that the more negative feedback is
buried under a continuous spew of marketing. But using Twitter Search
we can completely filter out the brand's own messaging and search
only the replies to it. This gives a much truer picture of the
brand's performance, and we additionally get to see whether the brand
addresses issues raised by members of the public, or simply ignores
them.

<https://popzazzle.blogspot.com/2022/01/ why-fact-checkers-need-to-fact-right-off.html>

<https://popzazzle.blogspot.com/2020/09/
content-marketing-on-low-budget.html>

It's no longer about the consumer. It's 50% an elitist closed shop
in which Amazon, eBay, YouTube and Co. win by default, and 50% a
"which established e-corp can bribe the most PR7s and pump the most
elaborate data graph into Silicon Valley?" contest.

CUSTOMISED SEARCH
=================

Instances of the decentralised search engine Searx (listed here--page
requires JavaScript) are often recommended as an alternative to
bigger web search engines. But it's rarely explained how the search capabilities offered by Searx can be rigorously customised to focus
on the best sources of information for a given subject.

<https://searx.space/>

Searx is all about metasearch. That is, compiling results from a
variety of different search indexes. But with Searx, you can choose
which indexes you want to query. If you've explored and tested
various instances of Searx, you've probably noticed that the search
results can be vastly different from one instance to the next. That's
because each one is set up by its administrator to query a different
selection of indexes. But the range of sources a Searx instance
queries is also open to user-customisation. By going into the
*Preferences*, you can define exactly whose results you want, and
whose you don't.

I'll use Searx Belgium as an example, because I've found it to be
reliable. There are tabs along the top of the results page that
denote categories of search. Once you've entered a search term and
have a results list on screen, you'll see that the results list is
headed with horizontal selection options such as General, Images,
Videos, News, etc. Unlike with Google, you can simultaneously choose
as many or as few of these search categories as you like. Just select
the tab or tabs you want and then re-click the Start Search button.

<https://searx.be/>

The Searx Preferences page illustrates just how many different
search resources there are, and names them so we can investigate
them in their own right.

Let's say you de-selected the *General* tab--which is selected by
default--and instead selected the *Social Media* tab. You'll see a
dramatic change in the results. Rather than being sourced from
Google, Wikipedia, etc (which are Searx Belgium's default sources for
General search), the results are now solely coming from Reddit (which
is Searx Belgium's default source for Social Media).

I really like having the option to get a selection of results solely
from Reddit, because community Q&A discussion is broadly a lot more
genuine than the output of some listicle merchant whose real goal is
not to help you solve a problem, but to pocket some commission from
Amazon. Even if the contributors on Reddit are not experts (and
sometimes they are), collectively they're likely to get you closer to
a real solution than an expert blogger who isn't even trying to help.

True, we could confine Google or DuckDuckGo search results to Reddit
by prefixing our search term with *site:reddit.com*--and this is one
of the only really reliable techniques left of filtering out the
annoying spam on major web search engines. But we've come to expect
greater convenience than having to type a website domain into a
search box, and that's what the tab system on Searx gives us.

Out of the box, the Searx instance in our example already offers some
easy ways to customise the search results for specific needs. But by
pitching into the *Preferences*, we can further tailor the sources
for each of those category tabs. For example, we could restrict the
image search sources solely to Unsplash, or Flickr. Then we filter
out all of the news site spam and very predominantly find photography enthusiasts instead.

Independence from major web search is something we can, and should,
try to build progressively.

Incidentally, if you do make any changes in Searx *Preferences*,
don't forget to scroll down and hit the *Save* button at the bottom
of the page. Otherwise your changes won't register. You'll also need
to have cookies enabled for the browser to remember your prefs.

The next step up from here is spinning up your own Searx instance.
This requires the use of a server (although it's included in the
pre-packaged installation options if you use FreedomBox). It does,
however, afford you an even more detailed realm of customisation. Not
everyone will go this far, but the option is there for those who want
to take it to another level.

<https://www.freedombox.org/>

One of the other great benefits of the Searx *Preferences* page,
beyond simply changing the searchable indexes, is that it illustrates
just how many different search resources there are, and names them so
we can investigate them in their own right. For instance, you might
spot Wiby among the General search options. What's Wiby?...

UNDERGROUND SEARCH ENGINES
==========================

Wiby sits aside Marginalia, representing a budding breed of search
engines that shun the modern internet and focus on the more simple
and imaginative web of yesteryear. A time of enthusiasm, as opposed
to pathological obsession with revenue. These obscure search engines
are incredibly refreshing to use, because they deliberately punish
the exact cash-crazed ideology that Google goes out of its way to
reward. Within moments, the offbeat output from these underground
resources illustrates just how tiresomely predictable Google Search
and its derivatives have become.

<https://wiby.me/>

<https://search.marginalia.nu/>

We simply can't trust a search engine to find a useful post again
next week, so anything at all that we have serious intentions of
revisiting, we realistically need to bookmark.

That small operators can build these products with limited indexes,
and serve results which wake us up in a way that the mighty,
multi-$billion Google has long since ceased doing, attests to a stark reality... Google no longer wants to stimulate us mentally. It just
wants to haul us into a commercial brainwashing system and fire off
its bullshit-ass lab-ratting schemes in every last corner of our
itinerary.

If you've recently tried to use a major web search engine to find
original, detailed, historical web analysis published in the 1990s or
early 2000s, you'll know how deeply frustrating it can be to solidly
encounter 500-word SEO spins that some half-assed journalist wrote on
a news site in 2020 or 2021. This is where underground engines like
Marginalia and Wiby really come into their own. If you want to know
what people were writing about Windows 98, *in 1998*, the best chance
you have of achieving that with a minimum of hacks and advanced
workarounds, is with a search engine like Marginalia or Wiby.

USING THE WIKIPEDIA CITATIONS LISTS AS SEARCH RESULTS =====================================================

Another clever way to access high quality, vetted resources with zero
spam, and zero advertising, is to employ a two-step process in which
the Wikipedia citations lists serve as sets of search results. Search
your query on Wikipedia, click through to the relevant page, then
scroll straight to the bottom and review the References, Sources or
External Links sections. Whilst a lot of entries will be hard copy
books or links to other pages on Wikipedia itself, there are usually
some links to definitive posts on other websites.

Wikipedia operates in a parasitic manner, taking information from
everywhere, whilst using "nofollow" link attributes to low-key strip
its sources of validity in the eyes of Google. So what tends to
happen is that the Wiki rises higher and higher in the search
results, while the visibility of the sources steadily declines. This
means that many of the sources cited in the Wikipedia References
lists, even though extremely high quality, are no longer prominent on
major web search engines. A perfect illustration of the problems with
search engines, as well as the Machiavellian behaviour of Big
Tech--of which Wikimedia is a component.

You can create your own small search engine just by scaling up your
bookmark collection. I would recommend this to anyone.

<https://backlit.neocities.org/
is-the-downfall-of-cloudflare-nigh.html>

But we can use the Reference lists themselves as valuable pointers to
quality sources, which lead us to real experts who can give us deeper
insight. In general knowledge fields, this method can be a lot more
productive (and certainly more reliable) than merely querying a major
web search engine. In using this method and visiting original, high
quality source sites, you're also helping to reward the people who
have been screwed over by Wikimedia and Google.

COLLECTIVE STRENGTH
===================

So, our search bookmarks now combine Wikipedia Search, Twitter
Advanced Search, a customised Searx instance or two, and the
retro-focused Wiby and/or Marginalia. This collective base gives us
better access to truthful, useful and insightful information than
we'll get from hopefully banging queries into Google, Bing or
DuckDuckGo. And importantly it also helps free us from much of the
timewasting irrelevance that formerly dominated our search results.
Simply, we see the word "Amazon" a heck of a lot less, and if you're
anything like me, that's a lifestyle-improvement in itself.

There will still be times when we need a more mainstream engine, but
the mainstream engines are now too overwhelmed with what Google used
to rank down as "webspam", to serve as a first resort. Now that
Google positively loves and encourages "webspam", and instead spends
its time ranking down sites that don't give friggin' Tag Manager
enough gainful employment, it's no longer about the consumer. It's
50% an elitist closed shop in which Amazon, eBay, YouTube and Co. win
by default, and 50% a "which established e-corp can bribe the most
PR7s and pump the most elaborate data graph into Silicon Valley?"
contest.

This is a conspiracy. And conspirators rarely fool the public
forever.

It's hard to stop relying on major search engines, because the one
advantage they still do have is convenience. Being able to search
everything from one place becomes a habit, and it's a hard habit to
break. But we have to move on from a reliance on search engines like
Google before things get even worse.

BUILDING A BANK OF NICHE RESOURCES
==================================

Independence from major web search can also be built progressively.
If we make a point of looking for search facilities on sites that do
provide us with good value, and then use those searches directly in
future, we steadily reduce our reliance on the ad machine.

For most people, web search engines don't really serve that many
specific purposes. So whilst we might imagine having to build a very
long list of niche resources in order to replace something like
Google, in fact, a relatively limited number of entries will cover
most of the ground.

As a writer, one of my common queries is a synonym search. I became
aware that I was searching for synonyms a lot, and that the sites I
ended up visiting often gave poor matches, or had a grim
user-experience. So the next time I found a site that gave me a good
user experience and useful synonyms, I bookmarked the
site--WordHippo--and used their internal search instead of constantly
searching the whole web. Much quicker, no wading through ecommerce
entries, and it's done the job.

BOOKMARKING - A MEASURE OF THE FAILED SEARCH ============================================

Through the twenty-tens I realised I was bookmarking more and more
URLs, and I can see today that I do it obsessively. We've reached a
point where we simply can't trust a search engine to find a useful
post again next week, so anything at all that we have serious
intentions of revisiting, we realistically need to bookmark.

But well-categorised bookmarking is another compound escape route
from major web search engines. Often, we know exactly which site we
want to go to, but we don't remember the URL or the precise domain
name. Is it .com or .org? .net or .me? Or .co.uk?... Many of us just
tap the site name into a search engine and hit the link in the
results. And even then we don't necessarily make a mental note of the
domain name. We just keep repeating the same behaviour. Running that
site name through a search engine every time we want to visit. This
might happen twenty times, forty times, or more. So just by adding
one bookmark to a browser, we might save ourselves scores of web
searches. It all adds up.

If dismantling a computer is more convenient than completing what
should be a straightforward search process, we really do need to
think again.

The general rise in bookmarking is a measure of how little confidence
we now have in web search. If you run a website, you're probably
seeing a lot more visits from bookmarking resources than you did even
just two or three years back. But I was struck a couple of weeks back
by the lengths to which I was prepared to go in order to use a
bookmark rather than a web search...

Recently I switched my day-to-day operating system setup from Bodhi
Linux with a dual-booting Windows partition, to a standalone Linux
Mint installation. I left the Bodhi/Windows hard drive in the PC, but disconnected it, then fitted a new disk, and installed Mint. Wicked.
Perfectly happy with Mint - no desire to get back into Windows...
Until I wanted to find an answer to a tech query that I'd seen on
Reddit.

<https://www.bodhilinux.com/>

<https://linuxmint.com/>

I tried two or three web searches and could see I was getting
nowhere. I could have pitched into the usual cat and mouse game of
trial and error, using strategic quotes, increasingly long tails,
etc. But the thought of all that was actually so gruelling that I
instead switched off the computer, unscrewed the casing, connected up
the other disk, rebooted, and went into Windows, where I knew I'd
saved the bookmark. We think of search engines as a covenience, but
if dismantling a computer is more convenient than completing what
should be a straightforward search process, we really do need to
think again.

SEARCH CRISIS
=============

The creeping perception that web search is not serving our needs is
as much a crisis for search engines as it is for us. To date, it
hasn't dented the top search engines' profitability, because Google
just blasts more and more Big Tech real estate into the results,
making the elitist cartel more and more money per query. But there's
only so far that can go. If it reaches the point where the public can
reliably predict which sites they're going to find in the search
results, there is no longer any point in them using a web search
engine.

We're already well down that road, and for Google, there's no turning
back. If Google boots its own, its partners', its lobbying pals
(raise your hand Wikimedia), and its corporate supplicants' domains
out of the results to allow the wider web back into the picture and
restore public faith, its profits are going to bomb. Google now
relies on corrupt search algorithms to hit its financial targets, so
the only question is how much of that increasingly unsightly road
there is left to travel before people begin jumping off the cart in
volume.

And at present? Well, it isn't that people don't realise how bad the
search results now are. They absolutely do. In research on Twitter, I
found a broad recognition that today's search results are worse than
they used to be. It's just that people blame publishers, and not the
search engines, for the decline.

But the thing is, the pages of yesteryear have not gone anywhere.
Wonderfully entertaining sites I became aware of in the late 1990s
are still up, still rigorously maintained and updated with high
quality writing. But they never show up today in the search results.
If I hadn't found out about them years ago, I wouldn't know they
existed. And if you Google the names of their admins - the
internet-famous of the AltaVista era - the results are topped not by
their excellent sites, but by Facebook and LinkedIn accounts. Some
belonging to randoms with the same name. So this is not a decline in
publishing standards. This is a conspiracy. And conspirators rarely
fool the public forever.

From: <https://backlit.neocities.org/how-to-find-the-lost-web>
--- Synchronet 3.21a-Linux NewsLink 1.2

From jerk-o@jerk_o2002@yahoo.com to comp.misc on Tue Jul 8 13:18:05 2025

From Newsgroup: comp.misc

On Tue, 8 Jul 2025 14:11:46 -0000 (UTC), Ben Collver
<bencollver@tilde.pink> wrote

UNDERGROUND SEARCH ENGINES
==========================

Wiby sits aside Marginalia, representing a budding breed of search
engines that shun the modern internet and focus on the more simple
and imaginative web of yesteryear. A time of enthusiasm, as opposed
to pathological obsession with revenue. These obscure search engines
are incredibly refreshing to use, because they deliberately punish
the exact cash-crazed ideology that Google goes out of its way to
reward. Within moments, the offbeat output from these underground
resources illustrates just how tiresomely predictable Google Search
and its derivatives have become.

<https://wiby.me/>

<https://search.marginalia.nu/>

Another one would be <https://yacy.net/>
--- Synchronet 3.21a-Linux NewsLink 1.2

From candycanearter07@candycanearter07@candycanearter07.nomail.afraid to comp.misc on Wed Jul 9 19:20:03 2025

From Newsgroup: comp.misc

Ben Collver <bencollver@tilde.pink> wrote at 14:11 this Tuesday (GMT):
[snip]

FOR ACCURATE INFO, TWITTER SEARCH IS NOW MORE USEFUL THAN GOOGLE
================================================================

It's almost incomprehensible that the worldbeating sophistication of
Google Search could regress so far as to allow a micro-blogging site
to provide more relevant information, but that's where we are. And
one of the main reasons Twitter Search has become more popular than
Google with many people who research for a living, is the platform's
rigid protection of chronological integrity.

There are three components to this...

* All Tweets are dated and uneditable.

* Twitter Search allows us to define a date range.

* One of the best ways to find a relevant search result is to filter
out spam, and spam tends to come in waves, which are based on
trends and current affairs. In other words, a reliable date filter
can serve as a reliable spam filter.

[snip]

Twitter has its own problems, but I'll keep that in mind. Does anyone
know if its possible to search on Twitter without an account?

From: <https://backlit.neocities.org/how-to-find-the-lost-web>

--
user <candycane> is generated from /dev/urandom
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online
Recent Visitors
- Ptb1970
  Sat Dec 13 17:34:42 2025
  from Wisconsin via Telnet
- Microbot
  Sat Dec 13 17:04:31 2025
  from Moore, Ok via Telnet
- John F Kennedy
  Fri Dec 12 21:48:00 2025
  from Crazyworldbbs.Com:2323 via Telnet
- Microbot
  Fri Dec 12 18:16:00 2025
  from Moore, Ok via Telnet

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,089
Nodes:	10 (0 / 10)
Uptime:	155:27:17
Calls:	13,921
Calls today:	2
Files:	187,021
D/L today:	3,943 files (998M bytes)
Messages:	2,457,202

Search 101: How to Find the Lost Web

Who's Online

Recent Visitors

System Info