Article 42468 of comp.sys.cbm:
Xref: undergrad.math.uwaterloo.ca comp.sys.cbm:42468
Newsgroups: comp.sys.cbm
Path: undergrad.math.uwaterloo.ca!csbruce
From: csbruce@ccnga.uwaterloo.ca (Craig Bruce)
Subject: Re: Commodore 128 Multitasking OS (LONG)
Sender: news@undergrad.math.uwaterloo.ca (news spool owner)
Message-ID: <DF6FKs.DEJ@undergrad.math.uwaterloo.ca>
Date: Tue, 19 Sep 1995 23:50:52 GMT
Nntp-Posting-Host: ccnga.uwaterloo.ca
Organization: University of Waterloo, Canada (eh!)

hermit@cats.UCSC.EDU (William R. Ward) writes:

>			     Multitasking
>
>The multitasking kernel (BOS) that Craig Bruce wrote is very
>interesting.  See http://ccnga.uwaterloo.ca/~csbruce/bos2.txt for
>info.  I am somewhat interested in hearing more from Craig about
>future plans for I/O and networking, and how this might relate to or
>be compatible with ACE (if at all).

My kernel is a distributed microkernel, and hence, the networking will be
built right into the kernel but most of the I/O will be implemented as
user-level- process servers.  Accessing the various system services will be
very different from ACE, since messages will be required to access the
(potentially remote) servers.  But, the the Application Program Interface
will be a set of library calls compiled into each program that will present
an interface that is identical to the ACE API, with maybe some minor
exceptions.  All of the ACE applications will be ported to BOS, with
minor or no modifications.

>Craig Bruce's kernel switches tasks once a jiffy, approximately.  The
>VIC raster interrupt is used to implement this.  I think that's a good
>way of doing it, but would like to consider the options.  There are
>three ways to interrupt on the 6502 family; IRQ (the VIC raster
>interrupt), NMI (the RESTORE key), and the BRK instruction.  The
>set/disable interrupt bit in the status register only affects IRQ and
>BRK, not NMI (hence non-maskable!).

The interrupt-disable bit in the status register does not mask out BRKs; BRK
is more of a "trap" rather than an "interrupt".

>I seem to recall that the CIA
>chip can generate an NMI at regular intervals; this may be a good way
>to invoke the swapper, no?  Care would have to be taken that an NMI
>doesn't occur while one is already in progress, but otherwise it would
>allow the use of IRQ for interrupts that can only happen when not in
>kernel mode (assuming the kernel disables interrupts while not running
>user processes).  Is this ability worth the hassle of using the less
>convenient NMI rather than IRQ?

I think that using periodic IRQs is the only sane way to do this.  Using
NMIs gives no real advantage and means that all of the critical sections in
your code can be interrupted.  Good luck getting that to fly.

>		     Memory and Stack Management
>
>I would like to learn more about how the 128's MMU relocates the stack
>and zeropage.  What restrictions are there?  How does it confuse the
>rest of the pages of memory?  For instance, if you relocate the zero
>page to $1100 and the stack to $1200, do the "real" zero page and
>stack appear at those addresses, or a "copy" of the relocated pages?
>Craig, would you mind explaining how the MMU does this, since you use
>this in your BOS?

Relocating the stack and zero-pages is very simple and is explained fully in
the Commodore-128 Programmer's Reference Guide.  You just poke the page
number into the appripriate register, and poof, all references to
$0000-$00ff and $0100-$01ff are headed off at the pass and redirected to the
selected pages.  You can also use the absolute addresses of the relocation
pages to access their contents, though there is no reason to do this.  The
*real* zero page (the one at absolute address $0000 or RAM0) becomes
de-selected when you use a different page for zero-page, and you can no
longer access the absolute one until you select it again.  It's very
simple.

But there is one restriction, I think.  If you have selected any "common"
memory between the internal banks (also described in the C128 PRG), then
your relocated pages can only reside in RAM0 memory.  This isn't really a
big restriction unless you want to have way more processes than you probably
have internal memory for.

>I think because of the 128's MMU, more memory, and 2MHz mode it would
>be much more reasonable to do a multitasking OS on the 128 than the
>64.  There aren't as many 128 users out there, but there are still
>quite a few.  Thus I have renamed the subject line to be 128-specific.

IMHO, the C128's MMU and extra memory make multitasking reasonable.
Anything less, and you're going to have too much context-switching overhead
and too little space for application programs (when you have your kernel and
device drivers all loaded).

>Jonas suggested a means of doing stack management by using only the 1
>page for the stacks, where each process requests n bytes of stack
>space, and the kernel allocates chunks of stack accordingly.  When
>stack space is needed for a new task and there is not enough
>unallocated stack space available, the stack space of a process will
>be "swapped" out.  This is innovative but if the MMU can give a
>separate stack page for each process that would be preferable.  A
>multitasking OS for the C64 might use this approach however.  Its main
>disadvantage is more overhead for swapping and predicting stack usage
>for each process.  Also it is a lot easier for processes to overrun
>other processes' stack areas.

You don't need this on the C128, since the MMU can efficiently provide an
independent 256-byte stack to each process, for the context-switching cost
of changing the contents of one MMU register.

For the C64, it may be a reasonable way of doing things.  Of course, there
is the problem of two processes that want to use the same range of two
different swapped versions of the stack.

>Tommy suggested methods for allocating zeropage addresses.  The
>program loader would allocate specific zeropage addresses for each
>process, and modify the code much like the relocatable code algorithms
>mentioned later in this article.  One problem with this is it prevents
>programs from being reentrant (using the same binary image for
>multiple processes works fine as long as (a) it is not self modifying
>and (b) the data space is separate for each process).  However since
>the 128's MMU allows each process to have its own zeropage we
>shouldn't need to deal with it, much like Jonas's stack algorithm.
>Again, good for a 64 multitasking OS, but the 128 can do better.

The MMU obviates this problem on the C128.

>Tommy suggested using FORTH-style blocks for virtual memory.  I will
>need to read up on FORTH before I can comment on that, unless someone
>here would care to give some thoughts on that.

Might be an interesting thing to try, if you're really cramped for real
memory.  Of course, performance can very easily go down the toilet, and you
need random access to disk storage.  The ACE (BOS) far-memory mechanism is
capable of implementing virtual memory in a similar way.

>			 Relocatable Binaries
>
>When I originally started this thread I knew nothing about relocatable
>binaries, but others have posted much detail lately in connection to
>both this and ACE.  What's basically needed is for the assembler to
>produce a "header" that the program loader can use to shuffle all the
>internal byte references.

A long time ago, I posted an article describing what I intended to do about
relocatable binaries for BOS.  Here it is again:

Article: 34516 of comp.sys.cbm
From: csbruce@ccnga.uwaterloo.ca (Craig Bruce)
Subject: Re: Thoughts on BOS
Message-ID: <D51u4y.2Co@undergrad.math.uwaterloo.ca>
Organization: University of Waterloo, Canada (eh!)
Date: Tue, 7 Mar 1995 02:41:22 GMT

1. EXECUTABLE-FILE FORMAT

The way that I am planning to do it is to have a special format for executable
programs in BOS.  The executable format would have three sections: the header,
the raw program image, and a relocatable-references section.  The format would
look something like the following:

OFFSET   SIZE   DESCRIPTION
------   ----   ------------
     0      3   magic number for executables
     3      1   flag: executable or object module
     4      2   size of raw code image
     6      2   raw code start address
     8      2   raw address of entry point
    10      2   size of "bss" (uninitialized storage) section
    12      4   size of relocatable-references section
    16      4   size of exported-symbol section
    20      4   size of unresolved-symbol section
    24      4   size of unresolved-expression section
    28      8   reserved for future expansion
    36      x   <raw code image>
    xx      x   <relocatable-references section>

This format will also double as the object module format, which will be a
superset of the executable format (since there is more information to keep
track of).

The "magic number" field would tell the type of the file (and prevent you from
accidentially executing something that isn't a program) and the "flag" field
would distinguish between an executable file and an object module.  An object
module would not be executable because it wouldn't be a "complete" program.
An object module would require a number of other sections to be appended to
the file format, which are mentioned in the header but not included in an
executable file.

The "raw code image size" field tells the number of bytes in the "raw code
image" section of the file, which contains simply the raw image of the
program, assembled to execute at some address given by the "raw code start
address" field (it's not really necessary to assemble the program from address
0; in fact, this can cause some confusion in the assembler in deciding between
absolute and zero-page addressing modes).  The "raw address of entry point"
gives the point to start execution in the program image, in terms of the
raw addresses used in the code.

The "size of bss" field tells how many bytes of uninitialized variable space
the program requires.  The bss is logically part of the raw code image, but it
is not included there because it would be wasteful to keep its bytes in the
executable file since they are supposed to be "just random" (or all zeroes if
you're talking about a Unix system).

The "size of relocatable references" filed tells how many bytes are in the
"relocatable-references section" in the file.  The format of this section can
be:

OFFSET   SIZE   DESCRIPTION
------   ----   ------------
     0      2   number of full-address relocations
     2      2   number of low-byte relocations
     4      2   number of high-byte relocations
     6      x   <addresses of full-address relocations>
    xx      x   <addresses of low-byte relocations>
    xx      x   <addresses of high-byte relocations>

There are three things that need to be relocated: full addresses (like
JSR SUB), low bytes (like LDA #<TABLE_ADDR), and high bytes (like
LDA #>TABLE_ADDR).  The first three fields give the number of each type of
relocation that has to be done, and the next three fields give the addresses
in the raw code range where these relocations need to be applied.  Each
address in this section takes two bytes.  For a really big and complicated
program, this section could exceed 64K in size, so a 32-bit value for this
section's size is given in the file header.  For the three example
instructions given here, the relocation address would always be the second
byte of the instruction.  For assembler directives like:

tableA: db <code1, <code2, <code3
tableB: db >code1, >code2, >code3
tableW: dw code1, code2, code3

Where the "codeN"s are all addresses in the program range, the relocation
addresses would all be where the specified bytes and words get put into
memory.

Automatically coming up with the relocation addresses requires inside
information that is available only at the symbolic level of a assembler, so
generating this relocation information has to built into an assembler.  The
assembler has to keep track of which values are addresses that are inside of
the code range and which operands are not.  The assembler also has to
carefully evaluate expressions like:

LDA #>code3-code2
LDX #<code1+5
LDY #>code3-code2+code1

(Note: the "<" and ">" initial operators here are low-precedence
extract-low-byte and extract-high-byte operators, respectively.)

The operand for the LDA instruction is not an address high-byte; it's a
constant!  Two code sections will have a constant difference.  The LDX example
IS a relocatable reference, and so is the LDY (the size of the distance
between code3 and code2 is constant, plus a relocatable reference, gives a
relocatable reference).

Fortunately, such an assembler exists.  Well, almost.  It's called "the ACE
assembler".  It doesn't keep track fully of addresses now nor create
relocatable executable files, but it will.  It has the basic ingredients for
doing this already implemented.  And, even more important, I am on the inside
track with respect to how it works and how to change it, since I created it.

There are, of course, expressions that can't sensibly be resolved, like:

code1 * code2
code1 + <code2

but can just make the assembler produce warnings and do the best that it can
with them.  These constructs aren't useful anyway.  (Note that these can be
resolved _symbolically_, however).  We also don't have to worry about branch
instructions since their operands will all be contained within the code range
(and hence be constant values).

The other fields of the executable-file header are not used for executables.

2. CODE "LOADING" ALGORITHM

The first thing we do when we have to run a binary program is locate and open
its executable file.  We won't be using the Commodore-Kernal LOAD call, since
it isn't necessarily good enough for us, and BOS will be able to "read" files
at great speed anyway.  So, we open the file, read in the header, and save the
header information somewhere handy.

Then we want to load the program.  We first allocate the memory to hold the
program, on any convenient internal bank of memory (they will all look the
same to user programs).  The size of memory to allocate will be the raw
program image size plus the bss-section size.  We then load the program image
in one big gulp (it should be doable in a single read() operation, since
read()s will be implemented like they are in ACE).

We then scan through the relocatable references section (in suitably large
gulps for I/O efficiency) and relocate each reference.  To relocate a
full-address reference, we first figure out the "address delta" value for the
references.  This is the address that the program has been loaded at minus the
raw-image start address.  We add this delta value to each reference to get the
final address.  We do similar things for the other types of relocatable
references.

If we were to assume that programs will only be page-aligned when they are
assembled and when they are loaded, which will be a reasonable assumption for
BOS, then we don't have to worry about relocating anything except for
high-byte references and the high bytes of full-address references (cutting
out work a little bit).  Low bytes are included in the above file format since
this information will also be used when relocating object modules (when
linking program fragments and libraries together), and this will be done one a
byte granularity.

After we have loaded and relocated the program, we are ready to run it.  We
will create a process, initialize it, and set it loose at the relocated
entry point of the code.

Ta-dum.

Keep on Hackin'!

-Craig Bruce
csbruce@ccnga.uwaterloo.ca
"Very funny Scotty, now beam me my clothes!"    --bumpersticker

>			    Reentrant Code
>
>If you want to run two copies of a particular process at the same
>time, it would be best if you didn't have to load two identical copies
>of it into memory, no?  Well if the programmer is careful and the OS
>is designed correctly this is possible.

I would have a flag in the executable header that says whether the code
is re-entrant or not.

Keep on Hackin'!

-Craig Bruce
csbruce@ccnga.uwaterloo.ca
"Never question the relevance of truth, but always question the
 truth of relevance."