Replies (17)
there are many C compilers
Yeah, I had a terrifying moment there. Found this, starting research now
https://dwheeler.com/trusting-trust/
Do coders actually do this? He's saying write your own compiler hand and then bootstrap up to something big so you can trust it.
people who create languages are not normal
it would make more sense to write your compiler in another language and leave it like that, I don't see why not
This shit is fascinating.
that's because you never read a textbook on compiler design and construction. i did, when i was like, idk, 13 or something, i should try to remember exactly. borrowed a book using my mother's university library card.
never got arround to it quite yet, but it's on my list to build a slimmed down version of Go, which i'm working title "moxie" as in, the soda with gentian based flavor sold in USA) i've specced up the revised grammar, and mostly modified the yaegi go interpreter to run
i tried to build an actual binary compiler first, but that proved to be very complex and claude couldn't do it. but the interpreter was relatively easy, and once implemented, can then allow me to write the compiler in moxie.
the standard technique is to make a bootstrap compiler, which only has the parts implemented required for the compiler to work. then, with that compiler, you refactor it a bit so that you use the full version where you previously used more clumsy constructions to avoid making bootstrap 2 compiler difficult to build, and then you have the final compiler. i'm using the interpreter to do the bootstrap 1, and because it's just a small set of changes, and mostly removals, i will be at stage two, the second bootstrap, already, effectively, and with that write the full compiler, and add the execution environment and ABI targets to the code generator.
Go has an extensive toolkit for building tools that edit go code, which makes up a large chunk of what the compiler actually has in it. the tokeniser, which identifies all the reserved words, operators, etc, and generates a linear list of tokens. then there is an AST generator (abstract syntax tree) which produces the more complex, tree structured, syntax and grammar containing information, types, comment blocks, etc, and then after that, the compiler feeds the AST into the code generator, which can either generate another programming language (less common) or directly generate binary code for the processor.
i hope you enjoyed that explanation, about why language compilers are usually written in the language they compile. once you have one, you can then use it to build updated versions with bugfixes and add/remove language features.
Yes, the rust compiler is written in rust too.
The thing is at the beginning, you need to write the compiler for your language in a language that already exists. It'll take your code files and turn them into machine code.
But say you then write C code that can compile C code (read C code files and convert them into machine code), and you compile it using your existing compiler written in Assembly or whatever.
Then you can run the C compiler you wrote in C to compile new C code.
How small could you get the first iteration of the bootstrappageness? Are you worried about possible Trojans in compilers that you're trusting?
I don't get what you gain from the bootstrapping process
it's a sequence of hacks
much better if you wrote your moxie compiler in go directly, why not?
Because a self referencing compiler could build rebuild Trojans from the first iteration of compiler without it being detectable. You run your sha sum line to check that it matches the signed version and it looks fine, but that's because no one knew there was malicious code in it from the previous compiler.
Im probably saying this wrong.
it's an interpreter, and i could probably read the whole codebase within a couple of hours. it's not something i take very seriously - trojan in a programming language, and definitely not if i can read the source. such things would have clear red flags, like non-commented, strange hexadecimal constants that look like they could be binary code. those don't even belong in a compiler, there is no purpose for this. for elliptic curve functions, those are always well documented elements of the arithmetic group, and all the endomorphisms and symmetries and all that.
yes, bootstrapping can be very simple. the core C syntax is an example of a language that is already so small that it's simple to build into itself. but C has some terrible elements in it, like unclear bit lengths that are often platform dependant (this was the hardest part when i ported a hamming code error correction algorithm from C into Go some time back), the union type is an abomination, the pointer dereference operator versus the dot operator for struct fields is confusing. all of these complexities lead to slow compilation, and harder validation of the correctness of the translation into machine (or other) language.
things you can leave out of Go that are not needed for a compiler are pretty extensive. channels, mutexes, atomics, goroutines, interfaces, probably you can get away with no maps by using inverted indexes or implement a key/value index map purely with slices. you could leave out slices, too, but that would require a lot of fiddling with creating them as they are extremely handy for parsing streams of bytes. you could leave out strings, too, because they are immutable and weird (and that's one of the things i'm removing anyway), but of course you would still need a string literal, it just would map to a slice of bytes (ascii/utf-8).
nah, it's a reasonable question but i am very familiar with how the parts all work, from many years, i have written tokenizers, and even a simple recursive tree structure lexical analyser that i played with to build a novel command line syntax with tree properties. at the time i was building an abstraction for a simple language, essentially.
there is no way to hide malware in a syntax, only a lexicon. i'm taking a language with one of the smallest lexicons, and razoring it down even further. probably quite a swathe of the standard library, i'm not even going to be using, but partlially hand-translating it, mostly just filling the gaps in the parts where i have removed features that they used in the Go compiler.
so, yeah. it's not likely, it's a small search space to detect, and there are no common ways in which code can be backdoored.
LLMs on the other hand, can hide reams and reams of malware in them, but with the scope of the subject matter, there's nowhere it can transfer it forward.
Woosh! Sooooo many words I don't know. But it sounds like C can be small. Could you segment off different things and kinda make it modular, so that you add in a piece after you build it for additional capability? Those channels, mutexes, interfaces and maps? That way you could always return to the initial compiler and then redo the entire bootstrap.
That's reassuring.
Tiny C Compiler is 100 KB. I like it.
yeah, C is a small language. that's why it's so often used in kernels and drivers, it's actually a smaller lexicon than common assembler/instruction sets, it trades lexicon down for structuring.
actually, that's an example of why i love the Go loop syntax. it's one reserved word, and like 12 or more different grammars.
lol why do you write so fucking much