To monitor and respond to events in real time, stream analytics have a soaring demand for high throughput and low latency. Central to meeting demand, even in a distributed system, is the performance of a single machine. This paper presents StreamBox, a novel stream processing engine that exploits the parallelism and memory hierarchies in modern multicore hardware. StreamBox executes a pipeline of transforms over records that may arrive out-of-order. For each transform, it groups records in ordered epochs based on watermark timestamps that guarantee no subsequent record timestamp will precede it. The key contribution of this work is the generalization of out-of-order record processing to out-of-order epoch processing per transform to produce abundant parallelism. We introduce a data structure called cascading containers that manages dependences and concurrency among multiple concurrent epochs in each transform and in the pipeline, maximizing available parallelism while minimizing synchronization overheads. StreamBox creates sequential memory layout of records based on temporal windows and steers record flows to optimize NUMA locality. On a 56-core machine, StreamBox processes up to 38M records per second (38 GB/s), which is comparable to a cluster of 100 – 200 CPU cores, while reducing the pipeline delay by 20× to 50 ms.