Some comments on virtual memory and paging in general:

Virtual memory is a means by which it can be made to appear that the computer has more memory than it actually does. The way it is implemented is with a page table. The memory is divided up into fixed sized pages (on 32 bit microprocessors, pages were always 4096 bytes in size). Memory addresses were divided into two parts: page number being the top 20 bits (in a 32 bit address architecture with a 4kbyte page size) and the address in the page being the lower 12 bits.

The processor maintains a page table indexed by virtual page number that contains the physical page number and some other information such as

is the page dirty i.e. has it been written to since it was last flushed to disk,

is it valid i.e. is the page actually loaded into physical RAM at all

some protection information e.g. is the CPU allowed to write to the page (or read from it, for that matter).

Every time the processor wants to access the RAM, it first has to go to the page table (which is itself in RAM) to find the physical page for the virtual page in the address. Thus you double the number of memory accesses required just to read a word from memory.

There's the first performance hit. By turning on virtual memory, you double the number of RAM accesses required. Not only that, if your memory management unit was in a separate coprocessor (as in the 68020) there is a performance hit just for the communication between the two.

In mainframe and minicomputers, this performance hit was mitigated with fancy hardware which did sophisticated caching - not so much on the 80386 and 68020 and 68030.

A second problem: you needed a paging table. If you wanted to allow processes to map the full 4Gb of virtual memory, your paging table needed 220 entries of 32 bits each. That is 4Mb. If your computer has 4Mb then all the physical RAM is being used just for the page table. The problem of the page table eating RAM is the main reason why Microsoft's recommendation to keep the paging file at no more than twice the size of the physical RAM and the similar restriction for System 7 was probably for the same reason.

Note also that, if you want each process to have the illusion of having a unique address space, you actually need a page table per process and the page table needs to be swapped on each context switch.

Now let's talk about accessing pages. When the CPU wants to look at the content of a virtual address, assuming it is not cached, it looks up the virtual page in the page table and if the page table says the virtual page is mapped to a physical page, a new address is constructed of the physical page and the offset within the page and the data is fetched. If the virtual page is not mapped to a physical page, then depending on wht kind of data is stored in the virtual page, one of several things happen.

Firstly, the page is not there so the process cannot continue. A trap called a page fault is generated and the OS suspends the process, then it tries to fix the problem by allocating a new physical page and:

if the page comes from the running program, the physical page is loaded from the program's executable file on disk.

if the page is newly allocated data e,g, from a malloc request, the physical page is filled with zeroes

if the page has previously been swapped out, the physical page is loaded from the paging file.

In all of these cases there is a performance hit: the process has to wait for either a page to be zeroed out or 4kb of data to be loaded from disk. Obviously, in the latter case, if either the paging file or the program is fragmented, it's going to take longer.

What if there aren't any physical pages free to allocate to the process? In this case, a page needs to be thrown away. But which one to choose? The optimal method is to throw away a page you are not going to use again or to throw away a page that you are not going to need for a very long time. Unfortunately this requires clairvoyance so expensive mainframes at the time of the 80386 and 68030 used an algorithm called LRU which stands for "least recently used" which gives quite a good approximation to the clairvoyant algorithm. In LRU you throw away the page that has not been used for the longest time. This requires expensive hardware support, namely a time stamp for each physical page which is updated every time it is used. Microprocessors had to use more complicated and less satisfactory algorithms because they didn't have the time stamp support.

Some of the answers to this question suggest that System 7 always wrote pages to the paging file when they were dirty (i.e. the data in them was modified). This is important because pages that are not dirty are much cheaper to throw out because you don't have to write them to the page file first. However, that makes the assumption that all pages will be thrown out at some point. If you have loads of RAM you incur the hit of writing dirty pages to disk even if no pages are ever going to need to be swapped out.

So, in summary, virtual memory is a performance hit because

each memory access actually needs two memory accesses

the page table (or tables) itself uses up RAM

if a virtual page is not allocated a physical page, you have to wait for the OS to allocate one

retrieving pages from disk and writing them to disk takes a long time

We still use virtual memory because

each process can appear to have the entire address space to itself which greatly simplifies programming

paging eliminates RAM fragmentation from the point of view of the operating system

you need less RAM than the sum of the memory requirements of all running processes.

Edit

With respect to the page table and CPU cache that traal mentioned, CPU cache does indeed mitigate the problem of multiple memory accesses somewhat but the 68030 only had a 256 byte cache.