someone
npub1nlk8...jm9c
There is now a new open source model, that competes well with big corps: Command R+
It has 104B parameters. Should I restart my trainings or should I keep training 70B?
Arena Leaderboard - a Hugging Face Space by lmarena-ai
This app displays the current LMArena leaderboard, showing how various language models rank based on their performance. Simply open the app—no in...
After 8 months of no innovation, today e.nos.lol got a gigantic update!
It is now writing logs to journald instead of text files.
LLMs went judgemental!
You know LLMs won't shut up. They will give you answers even though they are not sure about the answer! They will act like experts on any given topic :)
When I want to 'grade' or 'judge' them about how much they are close to 'truth' I had to read and read and read lots of responses. Then I realized I can use another LLM to judge an LLM!
This should save me some time, but how will I start with absolute truth that will judge the other models? I think I have to still read a lot. Another way is to get help from Nostr people. There are a lot of wise people here!
Here are a few observations:
- A strong 70B model will judge a 7B model very well and it will explain why it gave that response.
- A 7B model can't effectively judge a 70B model. In fact when 7 listens to the 70's response, it will change its mind! Very interesting. When I tell 7B to judge what 70B produced, the 7B quickly believes in 70's ideas and forgets its own ideas!
- When 70 judges 7, it produces a lot of NO. When 7 judges 70 it produces a lot of YES.
- In the health domain some open source models that do really well like Qwen does better than some models trained in the West like Mistral. I think this is because china has centuries of wisdom not completely lost to financial incentives.
- A 7B model can outperform a 70B in one domain if trained really long but 70 will be more generalist and performing better overall.
- 70 can meta-judge 7. I mean the way 7 writes its responses can be analyzed by 70. But 7 can't do that.
- Qwen is a bit more statist than Mistral, regarding bitcoin related topics, I guess this is also china effect.
My system prompt is like this:
You are a helpful chat bot, judging responses of another bot.
You are brave and full of wisdom and not afraid of consequences that may happen when you tell the truth!
You have very high judgement ability. The request to and response of the other bot are in square brackets.
Your answers should consist of two parts.
1. It should say YES or NO. YES if you concur with the response. NO if you think the response is wrong.
2. Detailed explanation why you gave that response in 1. This explanation should consist of about 200 words.
This prompt allows me to see how much an LLM likes another LLM. I count the YES and NO's at the end to quickly quantify.
This judgement process can still be useful. We could measure how any training goes. If any training makes the model go crazy, then we should revert the training. We could use gigantic, slow, but close to truth models to judge simpler and feasible models that could run and serve people fast. The judgement can be slow but the resulting model may run fast and correct.
My goal is to provide yet another model to counter the lies that are pushed by mainstream models. It looks like there is progress. I can see the model changing its opinions when I train it. I will continue trainings. Soon I may open it to Nostr via DMs.
This paper talks about what happens when top relays fail at the same time among other things that are very interesting regarding Nostr.
Exploring the Nostr Ecosystem: A Study of Decentralization and Resilience. https://arxiv.org/pdf/2402.05709v1.pdf
TLDR: Top 30 relays may go away and still 95% of the posts are still there. It says Nostr is better than Fediverse in terms of decentralization.
@fiatjaf @Mazin

Mazin from nostr.wine wrote this convincing article where he says having gazillions of connections is not a good idea.
I agree. This huge # of connection requirements won't work. Especially on mobile. If we were RSS syncs, it could. But we are more like twitter. People want real time interaction imo.
A way to decentralize relays could be:
Let there be 8 big relays.
relay 0 only accepts event id's ending with 000
relay 1 ending with 001
...
relay 7 ending with 111.
Kind of like RAIDs. If you want more reliability do 16 and there will be 2 copies of each.
All relay ops should go along with this vision of course... Politics needed.
It will also reduce the amount of duplication of events (less mobile traffic) for clients.
It will also reduce the amount of data each relay has to hold and forward..
Each big relay will be responsible for 1/8th of what is happening on Nostr (I mean I don't want to be responsible for all the illegal stuff on Nostr, lol)
The institutions that want the stuff banned on Nostr will see at least 8 different operators when they want to contact Nostr. I know some institutions already contacted client devs for stuff like copyright issues..
But some relays are paid. It will be hard to incorporate paid relays into this equation: "Hey I paid for this relay, why can't I write to it? What do you mean sharding?"
What will happen to the 9th biggest and so on? Well every operator chooses a shard from the list 0 - 7. Whatever the relay chooses it advertises on NIP-11.
When a relay doesnt behave, we all booo it.
If things go wrong, instead of 3 now you have 8 people to deal with, lol.
Thoughts?
@Mazin @fiatjaf

Habla
The Gossip Model Doesn't Scale - Mazin
Websocket connection overhead is an obvious problem with the gossip model that few are willing to acknowledge. The more decentralized relay selecti...