• Search 101: How to Find the Lost Web

    From Ben Collver@bencollver@tilde.pink to comp.misc on Tue Jul 8 14:11:46 2025
    From Newsgroup: comp.misc

    Search 101: How to Find the Lost Web by Bob Leggitt ===================================================

    Privacy activists have a heightened sense of the way censorship
    works hand in hand with surveillance to build the classic picture
    of Nineteen Eighty-Four. And when we know a search engine is
    capable of giving us accurate, relevant results, but doesn't, we
    realise we're seeing a form of censorship.

    Google's lost the internet. You might have seen a few complaints.
    Whether they've come courtesy of anons in the underbelly of the
    Fediverse, or a viral soundbyte from Edward Snowden, a growing
    catalogue of gripes is asserting that web search is no longer fit for
    purpose. Well, unless web search's purpose is to detect capitalism.
    In which case thumbs up. The search engines are better than ever at
    that. They now surface ecommerce, ad-tech, and affiliate-pumped
    listicle hell so reliably that we barely even need to enter a search
    term.

    <https://twitter.com/Snowden/status/1460666075033575425>

    But the internet we used to know and love, brimming with offbeat gems
    from passionate authors... That's gone missing. And with it, the
    humour. The imagination. The individuality... Maybe we've just
    forgotten how to use a search engine?...

    Nope, it's definitely not us. Change the nuance of the query. Add a
    tail. Use quotes... It doesn't seem to matter anymore. We get the
    same list of crap based on one or two commercially-associable
    keywords, more or less whatever else we type. What we won't get, is
    what we actually searched for. If we're looking for a specific piece
    of information, the average web search engine is going to ignore the
    specifics and hammer us with a scripted smorgasbord of abject
    capitalism, augmented with a token entry from Wikipedia--frequently
    the only useful result.

    The value of being able to filter spam chronologically is immense,
    and it can completely demolish virtually any myth built by the
    information machine.

    But the thing is, Wikipedia has its own search engine, which is
    mercifully devoid of results from Amazon, eBay, and an army of affiliate-drones' listicle sites. So, if Wikipedia is often the only
    genuinely useful and non-commercial resource we're finding in the
    visible output of a web search engine, why are we still running to
    the likes of Google, Bing and DDG as a first resort? Why do we not
    just use Wikipedia Search as a basic source of general knowledge
    results, and supplement that with a range of other search facilities
    which are at least attempting to give us what we ask for?

    <https://wikipedia.org/wiki/Special:Search>

    <https://popzazzle.blogspot.com/2021/07/ duckduckgo-gets-blocked-by-privacy-protection-routine.html>

    That's exactly what this article is going to suggest.

    DO WE REALLY NEED TO REPLACE WEB SEARCH? ========================================

    It's no coincidence that many of the complaints about search quality
    are coming from privacy activists. Privacy activists have a
    heightened sense of the way censorship works hand in hand with
    surveillance to build the classic picture of Nineteen Eighty-Four.
    And when we know a search engine is capable of giving us accurate,
    relevant results, but doesn't, we realise *we're seeing a form of
    censorship*.

    Search engines have nannied us for a long time, assuming by default
    that we mistyped any query that isn't verified as a popular topic.
    But we're beyond that now. We're no longer in the realm of
    "Are you sure you want that?". We've descended into...

    No, you don't want that.

    "Yes I do."

    No you don't.

    "Yes I do."

    No you don't.

    Even if we still regard this as heavy nannying rather than
    censorship, do we really want to go through that ever-lengthening
    argument every time we run a web search? Just for the sake of our
    "framework of mind", as Dickie Valentino put it in the cult 1994
    movie There's No Business..., we have to start finding better ways to
    access information. It's unlikely we'll break free from major web
    search engines entirely, but now is definitely the time to start
    reducing our dependency on them.

    Once upon a time you could search Google for the world's most
    useless product and dive into a motley collection of chucklesome
    mock ads--topped, if I remember rightly, by an enthusiastic
    promotion for... Well, is it a screw? Is it a nail? No - it's a
    scrail! Still makes me laugh today. But that same search produces
    wall to wall e-corp listicles now. And if you search for scrails
    you get Amazon trying, in all seriousness, to flog you an actual
    packet of scrails. It's like, is there any Google search at all you
    can now run that doesn't eventually lead to Amazon?"

    FOR ACCURATE INFO, TWITTER SEARCH IS NOW MORE USEFUL THAN GOOGLE ================================================================

    It's almost incomprehensible that the worldbeating sophistication of
    Google Search could regress so far as to allow a micro-blogging site
    to provide more relevant information, but that's where we are. And
    one of the main reasons Twitter Search has become more popular than
    Google with many people who research for a living, is the platform's
    rigid protection of chronological integrity.

    There are three components to this...

    * All Tweets are dated and uneditable.

    * Twitter Search allows us to define a date range.

    * One of the best ways to find a relevant search result is to filter
    out spam, and spam tends to come in waves, which are based on
    trends and current affairs. In other words, a reliable date filter
    can serve as a reliable spam filter.

    When something becomes a talking point, the search results are
    overwhelmed by spammers, news sites, megablogs, etc, jumping on that
    talking point in a bid for search traffic. They know everyone is
    looking for info on that subject, so they produce content about it
    whether or not they have anything to say. This vast glut of very high
    ranking domains then squeezes out all of the previous results, and
    that usually makes finding previously published information through
    typical web search methods incredibly difficult--if not impossible.
    Some of these assaults of skimpily-researched verbal diarrhoea
    actually end up changing history, as the public accept bone idle
    journalism as truth and the reality is buried out of sight.

    But Twitter's Advanced Search allows us to cut through the spam by
    defining a date range in a search query. Because no one can
    manipulate the dates of Tweets, or any of the information contained
    within them, filtering out the period of the spam assault can
    completely remove the spam.

    <https://twitter.com/search-advanced>

    For example, if you want to know what the consensus on vaccination
    was in summer 2019 before the covid pandemic, a web search engine
    will overwhelm you with spam about the covid vax. But if you search
    on Twitter, limiting the date range to summer 2019, you will get
    precisely the consensus that existed at the time, and nothing else.
    There's no contamination, because no one can fake a summer 2019
    Tweet. If the Tweet is dated June, July or August 2019, then that's
    when it was published, and its content is what it contained back
    then. This is very different from the output of web search engines,
    where random third parties control the information sources, and you
    have all manner of people manipulating both post content and post
    dates in a bid to win traffic.

    BUSTIN' MYTHS
    =============

    The value of being able to filter spam chronologically is immense,
    and it can completely demolish virtually any myth built by the
    information machine. In the 2010s I got curious as to the origins of
    the Twitter hashtag. I wanted to know who invented the idea.
    Wikipedia and a clutch of other sites assured me it was Chris
    Messina. How predictable, I thought. Twitter hashtag invented by a high-profile, privileged dude with connections galore. But my life
    experience told me that *well-connected dudes with high public
    profiles are much better associated with taking credit for inventions
    than actually inventing them*. So I decided to check out the story on
    Twitter itself.

    Because of Twitter's chronological integrity and the fact that I
    could restrict the period of investigation to a time before Messina's
    claim, I was able to establish that Messina did not in fact invent
    the Twitter hashtag. I wrote a post documenting the truth back in
    2016. Sadly, it's been one of the least visited posts I've ever
    written. The search engines are quite happy with wall to wall
    regurgitations of the Wikipedia line. But the post does demonstrate
    how much more accurate Twitter can be as an information source than a
    typical web search engine. And whilst single Tweets are limited (by character-count) in their ability to elaborate on a story,
    collectively they can prove extremely thorough in the picture they
    provide.

    <https://twirpz.tumblr.com/post/676456221578035200/ twitters-real-hashtag-pioneers>

    These obscure search engines are incredibly refreshing to use,
    because they deliberately punish the exact, cash-crazed ideology
    that Google goes out of its way to reward.

    Twitter also affords us a directional filter on information. By
    default, we only really see what influential voices are saying. But
    we can filter a Twitter Advanced Search to show only the replies TO
    those influential voices. That directional filter can serve as an
    ideological filter and quickly take us to the opposing views which a
    web search engine can easily hide.

    This works brilliantly where marketing or propaganda is strong. For
    example, a brand is only ever going to tell you what it gets right.
    Never what it gets wrong. The brand will typically use SEO strategies
    with web search engines, to ensure that its official messaging
    occupies the whole front page, and that the more negative feedback is
    buried under a continuous spew of marketing. But using Twitter Search
    we can completely filter out the brand's own messaging and search
    only the replies to it. This gives a much truer picture of the
    brand's performance, and we additionally get to see whether the brand
    addresses issues raised by members of the public, or simply ignores
    them.

    <https://popzazzle.blogspot.com/2022/01/ why-fact-checkers-need-to-fact-right-off.html>

    <https://popzazzle.blogspot.com/2020/09/
    content-marketing-on-low-budget.html>

    It's no longer about the consumer. It's 50% an elitist closed shop
    in which Amazon, eBay, YouTube and Co. win by default, and 50% a
    "which established e-corp can bribe the most PR7s and pump the most
    elaborate data graph into Silicon Valley?" contest.

    CUSTOMISED SEARCH
    =================

    Instances of the decentralised search engine Searx (listed here--page
    requires JavaScript) are often recommended as an alternative to
    bigger web search engines. But it's rarely explained how the search capabilities offered by Searx can be rigorously customised to focus
    on the best sources of information for a given subject.

    <https://searx.space/>

    Searx is all about metasearch. That is, compiling results from a
    variety of different search indexes. But with Searx, you can choose
    which indexes you want to query. If you've explored and tested
    various instances of Searx, you've probably noticed that the search
    results can be vastly different from one instance to the next. That's
    because each one is set up by its administrator to query a different
    selection of indexes. But the range of sources a Searx instance
    queries is also open to user-customisation. By going into the
    *Preferences*, you can define exactly whose results you want, and
    whose you don't.

    I'll use Searx Belgium as an example, because I've found it to be
    reliable. There are tabs along the top of the results page that
    denote categories of search. Once you've entered a search term and
    have a results list on screen, you'll see that the results list is
    headed with horizontal selection options such as General, Images,
    Videos, News, etc. Unlike with Google, you can simultaneously choose
    as many or as few of these search categories as you like. Just select
    the tab or tabs you want and then re-click the Start Search button.

    <https://searx.be/>

    The Searx Preferences page illustrates just how many different
    search resources there are, and names them so we can investigate
    them in their own right.

    Let's say you de-selected the *General* tab--which is selected by
    default--and instead selected the *Social Media* tab. You'll see a
    dramatic change in the results. Rather than being sourced from
    Google, Wikipedia, etc (which are Searx Belgium's default sources for
    General search), the results are now solely coming from Reddit (which
    is Searx Belgium's default source for Social Media).

    I really like having the option to get a selection of results solely
    from Reddit, because community Q&A discussion is broadly a lot more
    genuine than the output of some listicle merchant whose real goal is
    not to help you solve a problem, but to pocket some commission from
    Amazon. Even if the contributors on Reddit are not experts (and
    sometimes they are), collectively they're likely to get you closer to
    a real solution than an expert blogger who isn't even trying to help.

    True, we could confine Google or DuckDuckGo search results to Reddit
    by prefixing our search term with *site:reddit.com*--and this is one
    of the only really reliable techniques left of filtering out the
    annoying spam on major web search engines. But we've come to expect
    greater convenience than having to type a website domain into a
    search box, and that's what the tab system on Searx gives us.

    Out of the box, the Searx instance in our example already offers some
    easy ways to customise the search results for specific needs. But by
    pitching into the *Preferences*, we can further tailor the sources
    for each of those category tabs. For example, we could restrict the
    image search sources solely to Unsplash, or Flickr. Then we filter
    out all of the news site spam and very predominantly find photography enthusiasts instead.

    Independence from major web search is something we can, and should,
    try to build progressively.

    Incidentally, if you do make any changes in Searx *Preferences*,
    don't forget to scroll down and hit the *Save* button at the bottom
    of the page. Otherwise your changes won't register. You'll also need
    to have cookies enabled for the browser to remember your prefs.

    The next step up from here is spinning up your own Searx instance.
    This requires the use of a server (although it's included in the
    pre-packaged installation options if you use FreedomBox). It does,
    however, afford you an even more detailed realm of customisation. Not
    everyone will go this far, but the option is there for those who want
    to take it to another level.

    <https://www.freedombox.org/>

    One of the other great benefits of the Searx *Preferences* page,
    beyond simply changing the searchable indexes, is that it illustrates
    just how many different search resources there are, and names them so
    we can investigate them in their own right. For instance, you might
    spot Wiby among the General search options. What's Wiby?...

    UNDERGROUND SEARCH ENGINES
    ==========================

    Wiby sits aside Marginalia, representing a budding breed of search
    engines that shun the modern internet and focus on the more simple
    and imaginative web of yesteryear. A time of enthusiasm, as opposed
    to pathological obsession with revenue. These obscure search engines
    are incredibly refreshing to use, because they deliberately punish
    the exact cash-crazed ideology that Google goes out of its way to
    reward. Within moments, the offbeat output from these underground
    resources illustrates just how tiresomely predictable Google Search
    and its derivatives have become.

    <https://wiby.me/>

    <https://search.marginalia.nu/>


    We simply can't trust a search engine to find a useful post again
    next week, so anything at all that we have serious intentions of
    revisiting, we realistically need to bookmark.

    That small operators can build these products with limited indexes,
    and serve results which wake us up in a way that the mighty,
    multi-$billion Google has long since ceased doing, attests to a stark reality... Google no longer wants to stimulate us mentally. It just
    wants to haul us into a commercial brainwashing system and fire off
    its bullshit-ass lab-ratting schemes in every last corner of our
    itinerary.

    If you've recently tried to use a major web search engine to find
    original, detailed, historical web analysis published in the 1990s or
    early 2000s, you'll know how deeply frustrating it can be to solidly
    encounter 500-word SEO spins that some half-assed journalist wrote on
    a news site in 2020 or 2021. This is where underground engines like
    Marginalia and Wiby really come into their own. If you want to know
    what people were writing about Windows 98, *in 1998*, the best chance
    you have of achieving that with a minimum of hacks and advanced
    workarounds, is with a search engine like Marginalia or Wiby.

    USING THE WIKIPEDIA CITATIONS LISTS AS SEARCH RESULTS =====================================================

    Another clever way to access high quality, vetted resources with zero
    spam, and zero advertising, is to employ a two-step process in which
    the Wikipedia citations lists serve as sets of search results. Search
    your query on Wikipedia, click through to the relevant page, then
    scroll straight to the bottom and review the References, Sources or
    External Links sections. Whilst a lot of entries will be hard copy
    books or links to other pages on Wikipedia itself, there are usually
    some links to definitive posts on other websites.

    Wikipedia operates in a parasitic manner, taking information from
    everywhere, whilst using "nofollow" link attributes to low-key strip
    its sources of validity in the eyes of Google. So what tends to
    happen is that the Wiki rises higher and higher in the search
    results, while the visibility of the sources steadily declines. This
    means that many of the sources cited in the Wikipedia References
    lists, even though extremely high quality, are no longer prominent on
    major web search engines. A perfect illustration of the problems with
    search engines, as well as the Machiavellian behaviour of Big
    Tech--of which Wikimedia is a component.

    You can create your own small search engine just by scaling up your
    bookmark collection. I would recommend this to anyone.

    <https://backlit.neocities.org/
    is-the-downfall-of-cloudflare-nigh.html>

    But we can use the Reference lists themselves as valuable pointers to
    quality sources, which lead us to real experts who can give us deeper
    insight. In general knowledge fields, this method can be a lot more
    productive (and certainly more reliable) than merely querying a major
    web search engine. In using this method and visiting original, high
    quality source sites, you're also helping to reward the people who
    have been screwed over by Wikimedia and Google.

    COLLECTIVE STRENGTH
    ===================

    So, our search bookmarks now combine Wikipedia Search, Twitter
    Advanced Search, a customised Searx instance or two, and the
    retro-focused Wiby and/or Marginalia. This collective base gives us
    better access to truthful, useful and insightful information than
    we'll get from hopefully banging queries into Google, Bing or
    DuckDuckGo. And importantly it also helps free us from much of the
    timewasting irrelevance that formerly dominated our search results.
    Simply, we see the word "Amazon" a heck of a lot less, and if you're
    anything like me, that's a lifestyle-improvement in itself.

    There will still be times when we need a more mainstream engine, but
    the mainstream engines are now too overwhelmed with what Google used
    to rank down as "webspam", to serve as a first resort. Now that
    Google positively loves and encourages "webspam", and instead spends
    its time ranking down sites that don't give friggin' Tag Manager
    enough gainful employment, it's no longer about the consumer. It's
    50% an elitist closed shop in which Amazon, eBay, YouTube and Co. win
    by default, and 50% a "which established e-corp can bribe the most
    PR7s and pump the most elaborate data graph into Silicon Valley?"
    contest.

    This is a conspiracy. And conspirators rarely fool the public
    forever.

    It's hard to stop relying on major search engines, because the one
    advantage they still do have is convenience. Being able to search
    everything from one place becomes a habit, and it's a hard habit to
    break. But we have to move on from a reliance on search engines like
    Google before things get even worse.

    BUILDING A BANK OF NICHE RESOURCES
    ==================================

    Independence from major web search can also be built progressively.
    If we make a point of looking for search facilities on sites that do
    provide us with good value, and then use those searches directly in
    future, we steadily reduce our reliance on the ad machine.

    For most people, web search engines don't really serve that many
    specific purposes. So whilst we might imagine having to build a very
    long list of niche resources in order to replace something like
    Google, in fact, a relatively limited number of entries will cover
    most of the ground.

    As a writer, one of my common queries is a synonym search. I became
    aware that I was searching for synonyms a lot, and that the sites I
    ended up visiting often gave poor matches, or had a grim
    user-experience. So the next time I found a site that gave me a good
    user experience and useful synonyms, I bookmarked the
    site--WordHippo--and used their internal search instead of constantly
    searching the whole web. Much quicker, no wading through ecommerce
    entries, and it's done the job.

    BOOKMARKING - A MEASURE OF THE FAILED SEARCH ============================================

    Through the twenty-tens I realised I was bookmarking more and more
    URLs, and I can see today that I do it obsessively. We've reached a
    point where we simply can't trust a search engine to find a useful
    post again next week, so anything at all that we have serious
    intentions of revisiting, we realistically need to bookmark.

    But well-categorised bookmarking is another compound escape route
    from major web search engines. Often, we know exactly which site we
    want to go to, but we don't remember the URL or the precise domain
    name. Is it .com or .org? .net or .me? Or .co.uk?... Many of us just
    tap the site name into a search engine and hit the link in the
    results. And even then we don't necessarily make a mental note of the
    domain name. We just keep repeating the same behaviour. Running that
    site name through a search engine every time we want to visit. This
    might happen twenty times, forty times, or more. So just by adding
    one bookmark to a browser, we might save ourselves scores of web
    searches. It all adds up.

    If dismantling a computer is more convenient than completing what
    should be a straightforward search process, we really do need to
    think again.

    The general rise in bookmarking is a measure of how little confidence
    we now have in web search. If you run a website, you're probably
    seeing a lot more visits from bookmarking resources than you did even
    just two or three years back. But I was struck a couple of weeks back
    by the lengths to which I was prepared to go in order to use a
    bookmark rather than a web search...

    Recently I switched my day-to-day operating system setup from Bodhi
    Linux with a dual-booting Windows partition, to a standalone Linux
    Mint installation. I left the Bodhi/Windows hard drive in the PC, but disconnected it, then fitted a new disk, and installed Mint. Wicked.
    Perfectly happy with Mint - no desire to get back into Windows...
    Until I wanted to find an answer to a tech query that I'd seen on
    Reddit.

    <https://www.bodhilinux.com/>

    <https://linuxmint.com/>

    I tried two or three web searches and could see I was getting
    nowhere. I could have pitched into the usual cat and mouse game of
    trial and error, using strategic quotes, increasingly long tails,
    etc. But the thought of all that was actually so gruelling that I
    instead switched off the computer, unscrewed the casing, connected up
    the other disk, rebooted, and went into Windows, where I knew I'd
    saved the bookmark. We think of search engines as a covenience, but
    if dismantling a computer is more convenient than completing what
    should be a straightforward search process, we really do need to
    think again.

    SEARCH CRISIS
    =============

    The creeping perception that web search is not serving our needs is
    as much a crisis for search engines as it is for us. To date, it
    hasn't dented the top search engines' profitability, because Google
    just blasts more and more Big Tech real estate into the results,
    making the elitist cartel more and more money per query. But there's
    only so far that can go. If it reaches the point where the public can
    reliably predict which sites they're going to find in the search
    results, there is no longer any point in them using a web search
    engine.

    We're already well down that road, and for Google, there's no turning
    back. If Google boots its own, its partners', its lobbying pals
    (raise your hand Wikimedia), and its corporate supplicants' domains
    out of the results to allow the wider web back into the picture and
    restore public faith, its profits are going to bomb. Google now
    relies on corrupt search algorithms to hit its financial targets, so
    the only question is how much of that increasingly unsightly road
    there is left to travel before people begin jumping off the cart in
    volume.

    And at present? Well, it isn't that people don't realise how bad the
    search results now are. They absolutely do. In research on Twitter, I
    found a broad recognition that today's search results are worse than
    they used to be. It's just that people blame publishers, and not the
    search engines, for the decline.

    But the thing is, the pages of yesteryear have not gone anywhere.
    Wonderfully entertaining sites I became aware of in the late 1990s
    are still up, still rigorously maintained and updated with high
    quality writing. But they never show up today in the search results.
    If I hadn't found out about them years ago, I wouldn't know they
    existed. And if you Google the names of their admins - the
    internet-famous of the AltaVista era - the results are topped not by
    their excellent sites, but by Facebook and LinkedIn accounts. Some
    belonging to randoms with the same name. So this is not a decline in
    publishing standards. This is a conspiracy. And conspirators rarely
    fool the public forever.

    From: <https://backlit.neocities.org/how-to-find-the-lost-web>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From jerk-o@jerk_o2002@yahoo.com to comp.misc on Tue Jul 8 13:18:05 2025
    From Newsgroup: comp.misc

    On Tue, 8 Jul 2025 14:11:46 -0000 (UTC), Ben Collver
    <bencollver@tilde.pink> wrote
    UNDERGROUND SEARCH ENGINES
    ==========================

    Wiby sits aside Marginalia, representing a budding breed of search
    engines that shun the modern internet and focus on the more simple
    and imaginative web of yesteryear. A time of enthusiasm, as opposed
    to pathological obsession with revenue. These obscure search engines
    are incredibly refreshing to use, because they deliberately punish
    the exact cash-crazed ideology that Google goes out of its way to
    reward. Within moments, the offbeat output from these underground
    resources illustrates just how tiresomely predictable Google Search
    and its derivatives have become.

    <https://wiby.me/>

    <https://search.marginalia.nu/>

    Another one would be <https://yacy.net/>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From candycanearter07@candycanearter07@candycanearter07.nomail.afraid to comp.misc on Wed Jul 9 19:20:03 2025
    From Newsgroup: comp.misc

    Ben Collver <bencollver@tilde.pink> wrote at 14:11 this Tuesday (GMT):
    [snip]

    FOR ACCURATE INFO, TWITTER SEARCH IS NOW MORE USEFUL THAN GOOGLE
    ================================================================

    It's almost incomprehensible that the worldbeating sophistication of
    Google Search could regress so far as to allow a micro-blogging site
    to provide more relevant information, but that's where we are. And
    one of the main reasons Twitter Search has become more popular than
    Google with many people who research for a living, is the platform's
    rigid protection of chronological integrity.

    There are three components to this...

    * All Tweets are dated and uneditable.

    * Twitter Search allows us to define a date range.

    * One of the best ways to find a relevant search result is to filter
    out spam, and spam tends to come in waves, which are based on
    trends and current affairs. In other words, a reliable date filter
    can serve as a reliable spam filter.

    [snip]

    Twitter has its own problems, but I'll keep that in mind. Does anyone
    know if its possible to search on Twitter without an account?

    From: <https://backlit.neocities.org/how-to-find-the-lost-web>
    --
    user <candycane> is generated from /dev/urandom
    --- Synchronet 3.21a-Linux NewsLink 1.2