Linux Proc Filesystem/Linux Proc文件系统
目录
- 1. 介绍
- 2. 分类描述
- 3. 清单化呈现(字母顺序排序)
- 3.1. proc/buddyinfo
- 3.2. proc/cgroups
- 3.3. proc/cmdline
- 3.4. proc/consoles
- 3.5. proc/cpuinfo
- 3.6. proc/crypto
- 3.7. proc/devices
- 3.8. proc/diskstats
- 3.9. proc/dma
- 3.10. proc/execdomains
- 3.11. proc/fb
- 3.12. proc/filesystems
- 3.13. proc/fs
- 3.14. proc/interrups
- 3.15. iomem
- 3.16. proc/ioports
- 3.17. proc/kallsyms
- 3.18. proc/meminfo
- 3.19. proc/<pid>
- 3.20. proc/self
- 3.21. oom_score
- 4. 内核实现
- 5. 参考资料
1 介绍
proc虚拟文件系统提供用户态与内核态数据交互的方式,一般挂载在/proc,由initrd启动 初始化时挂载。proc下绝大多数文件用以导出内核信息,但也有部分允许修改,这些改变 一般用于控制内核模块的功能或行为。
3 清单化呈现(字母顺序排序)
3.1 proc/buddyinfo
buddyinfo的内容是Linux系统伙伴管理系统的内存呈现,10列分别表示2nil0至2^10{} PAGE 内存数量。如下是某系统读取到的内容示例:
~$ cat /proc/buddyinfo Node 0, zone DMA 1 1 1 0 2 1 1 0 1 1 3 Node 0, zone DMA32 645 775 517 199 83 7 20 93 50 8 294 Node 0, zone Normal 143 100 14 14 5 1 3 2 1 2 0
内核代码linux/mm/vmstat.c:
static const struct file_operations buddyinfo_file_operations = { .open = fragmentation_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, };
3.2 proc/cgroups
cgroups(Control Groups)是Linux资源管理的特性,实现分级分组进行资源限制和监控。 通过伪文件系统cgroupfs提供交互接口。man 7 cgroups获取详细信息。
内核代码fs/proc/cmdline.c:
static int cmdline_proc_show(struct seq_file *m, void *v) { seq_printf(m, "%s\n", saved_command_line); return 0; } static int cmdline_proc_open(struct inode *inode, struct file *file) { return single_open(file, cmdline_proc_show, NULL); } static const struct file_operations cmdline_proc_fops = { .open = cmdline_proc_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init proc_cmdline_init(void) { proc_create("cmdline", 0, NULL, &cmdline_proc_fops); return 0; } fs_initcall(proc_cmdline_init);
~$ cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 2 1 1 cpu 3 1 1 cpuacct 3 1 1 memory 0 1 0 devices 4 76 1 freezer 5 1 1 net_cls 6 1 1 blkio 7 1 1 perf_event 8 1 1 net_prio 6 1 1
Man7信息截取:
Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored. The kernel's cgroup interface is provided through a pseudo-filesystem called cgroupfs. Grouping is implemented in the core cgroup kernel code, while resource tracking and limits are implemented in a set of per-resource-type subsystems (memory, CPU, and so on).
Terminology A cgroup is a collection of processes that are bound to a set of limits or parameters defined via the cgroup filesystem.
A subsystem is a kernel component that modifies the behavior of the processes in a cgroup. Various subsystems have been implemented, making it possible to do things such as limiting the amount of CPU time and memory available to a cgroup, accounting for the CPU time used by a cgroup, and freezing and resuming execution of the processes in a cgroup. Subsystems are sometimes also known as resource controllers (or simply, controllers).
The cgroups for a controller are arranged in a hierarchy. This hierarchy is defined by creating, removing, and renaming subdirectories within the cgroup filesystem. At each level of the hierarchy, attributes (e.g., limits) can be defined. The limits, control, and accounting provided by cgroups generally have effect throughout the subhierarchy underneath the cgroup where the attributes are defined. Thus, for example, the limits placed on a cgroup at a higher level in the hierarchy cannot be exceeded by descendant cgroups.
Cgroups version 1 controllers Each of the cgroups version 1 controllers is governed by a kernel configuration option (listed below). Additionally, the availability of the cgroups feature is governed by the CONFIG_CGROUPS kernel con‐ figuration option.
cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED) Cgroups can be guaranteed a minimum number of "CPU shares" when a system is busy. This does not limit a cgroup's CPU usage if the CPUs are not busy. For further information, see Documentation/scheduler/sched-design-CFS.txt.
In Linux 3.2, this controller was extended to provide CPU "bandwidth" control. If the kernel is configured with CON‐ FIG_CFS_BANDWIDTH, then within each scheduling period (defined via a file in the cgroup directory), it is possible to define an upper limit on the CPU time allocated to the processes in a cgroup. This upper limit applies even if there is no other competition for the CPU. Further information can be found in the kernel source file Documentation/scheduler/sched-bwc.txt.
cpuacct (since Linux 2.6.24; CONFIG_CGROUP_CPUACCT) This provides accounting for CPU usage by groups of processes.
Further information can be found in the kernel source file Documentation/cgroup-v1/cpuacct.txt.
cpuset (since Linux 2.6.24; CONFIG_CPUSETS) This cgroup can be used to bind the processes in a cgroup to a specified set of CPUs and NUMA nodes.
Further information can be found in the kernel source file Documentation/cgroup-v1/cpusets.txt.
memory (since Linux 2.6.25; CONFIG_MEMCG) The memory controller supports reporting and limiting of process memory, kernel memory, and swap used by cgroups.
Further information can be found in the kernel source file Documentation/cgroup-v1/memory.txt.
devices (since Linux 2.6.26; CONFIG_CGROUP_DEVICE) This supports controlling which processes may create (mknod) devices as well as open them for reading or writing. The policies may be specified as whitelists and blacklists. Hier‐ archy is enforced, so new rules must not violate existing rules for the target or ancestor cgroups.
Further information can be found in the kernel source file Documentation/cgroup-v1/devices.txt.
freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER) The freezer cgroup can suspend and restore (resume) all pro‐ cesses in a cgroup. Freezing a cgroup /A also causes its children, for example, processes in /A/B, to be frozen.
Further information can be found in the kernel source file Documentation/cgroup-v1/freezer-subsystem.txt.
net_cls (since Linux 2.6.29; CONFIG_CGROUP_NET_CLASSID) This places a classid, specified for the cgroup, on network packets created by a cgroup. These classids can then be used in firewall rules, as well as used to shape traffic using tc(8). This applies only to packets leaving the cgroup, not to traffic arriving at the cgroup.
Further information can be found in the kernel source file Documentation/cgroup-v1/net_cls.txt.
blkio (since Linux 2.6.33; CONFIG_BLK_CGROUP) The blkio cgroup controls and limits access to specified block devices by applying IO control in the form of throttling and upper limits against leaf nodes and intermediate nodes in the storage hierarchy.
Two policies are available. The first is a proportional- weight time-based division of disk implemented with CFQ. This is in effect for leaf nodes using CFQ. The second is a throt‐ tling policy which specifies upper I/O rate limits on a device.
Further information can be found in the kernel source file Documentation/cgroup-v1/blkio-controller.txt.
perf_event (since Linux 2.6.39; CONFIG_CGROUP_PERF) This controller allows perf monitoring of the set of processes grouped in a cgroup.
Further information can be found in the kernel source file tools/perf/Documentation/perf-record.txt.
net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO) This allows priorities to be specified, per network interface, for cgroups.
Further information can be found in the kernel source file Documentation/cgroup-v1/net_prio.txt.
hugetlb (since Linux 3.5; CONFIG_CGROUP_HUGETLB) This supports limiting the use of huge pages by cgroups.
Further information can be found in the kernel source file Documentation/cgroup-v1/hugetlb.txt.
pids (since Linux 4.3; CONFIG_CGROUP_PIDS) This controller permits limiting the number of process that may be created in a cgroup (and its descendants).
Further information can be found in the kernel source file Documentation/cgroup-v1/pids.txt.
rdma (since Linux 4.11; CONFIG_CGROUP_RDMA) The RDMA controller permits limiting the use of RDMA/IB-spe‐ cific resources per cgroup.
Further information can be found in the kernel source file Documentation/cgroup-v1/rdma.txt.
3.3 proc/cmdline
Linux系统引导参数:
~$ cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.16.0-4-amd64 root=UUID=62c7ce2d-be7b-43d1-b9e1-6002ee63577d ro quiet
内核代码fs/proc/cmdline.c:
static int cmdline_proc_show(struct seq_file *m, void *v) { seq_printf(m, "%s\n", saved_command_line); return 0; } static int cmdline_proc_open(struct inode *inode, struct file *file) { return single_open(file, cmdline_proc_show, NULL); } static const struct file_operations cmdline_proc_fops = { .open = cmdline_proc_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init proc_cmdline_init(void) { proc_create("cmdline", 0, NULL, &cmdline_proc_fops); return 0; } fs_initcall(proc_cmdline_init);
3.4 proc/consoles
consoles显示系统中所有可见的console信息。
内核代码fs/proc/consoles.c:
static const struct seq_operations consoles_op = { .start = c_start, .next = c_next, .stop = c_stop, .show = show_console_dev }; static int consoles_open(struct inode *inode, struct file *file) { return seq_open(file, &consoles_op); } static const struct file_operations proc_consoles_operations = { .open = consoles_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init proc_consoles_init(void) { proc_create("consoles", 0, NULL, &proc_consoles_operations); return 0; } fs_initcall(proc_consoles_init);
3.5 proc/cpuinfo
cpuinfo显示CPU相关信息:
~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz stepping : 3 microcode : 0x1c cpu MHz : 3200.390 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall n x pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb t pr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts bugs : bogomips : 6384.87 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz stepping : 3 ...
内核代码由平台无关fs/proc/cpuinfo.c和平台相关(arch/x86/kernel/cpu/proc.c)组成:
static int cpuinfo_open(struct inode *inode, struct file *file) { arch_freq_prepare_all(); return seq_open(file, &cpuinfo_op); } static const struct file_operations proc_cpuinfo_operations = { .open = cpuinfo_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init proc_cpuinfo_init(void) { proc_create("cpuinfo", 0, NULL, &proc_cpuinfo_operations); return 0; } fs_initcall(proc_cpuinfo_init);
3.6 proc/crypto
crypto显示系统支持的密码算法。部分内容示例如下:
~$ cat /proc/crypto name : crct10dif driver : crct10dif-pclmul module : crct10dif_pclmul priority : 200 refcnt : 1 selftest : passed internal : no type : shash blocksize : 1 digestsize : 2 ... name : sha1 driver : sha1-generic module : kernel priority : 0 refcnt : 1 selftest : passed internal : no type : shash blocksize : 64 digestsize : 20
内核代码crypto/proc.c:
static const struct seq_operations crypto_seq_ops = { .start = c_start, .next = c_next, .stop = c_stop, .show = c_show }; static int crypto_info_open(struct inode *inode, struct file *file) { return seq_open(file, &crypto_seq_ops); } static const struct file_operations proc_crypto_ops = { .open = crypto_info_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release }; void __init crypto_init_proc(void) { proc_create("crypto", 0, NULL, &proc_crypto_ops); } void __exit crypto_exit_proc(void) { remove_proc_entry("crypto", NULL); }
3.7 proc/devices
devices显示字符设备和块设备主设备号。如下:
~$ cat /proc/devices Character devices: 1 mem 4 /dev/vc/0 4 tty 4 ttyS 5 /dev/tty 5 /dev/console 5 /dev/ptmx 6 lp 7 vcs 10 misc 13 input 21 sg 29 fb 81 video4linux 99 ppdev 116 alsa 128 ptm 136 pts 180 usb 189 usb_device 216 rfcomm ... 253 tpm 254 gpiochip Block devices: 259 blkext 8 sd 65 sd 66 sd 67 sd 68 sd 69 sd 70 sd 71 sd 128 sd 129 sd 130 sd 131 sd
内核代码fs/proc/devices.c:
static const struct seq_operations devinfo_ops = { .start = devinfo_start, .next = devinfo_next, .stop = devinfo_stop, .show = devinfo_show }; static int devinfo_open(struct inode *inode, struct file *filp) { return seq_open(filp, &devinfo_ops); } static const struct file_operations proc_devinfo_operations = { .open = devinfo_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init proc_devices_init(void) { proc_create("devices", 0, NULL, &proc_devinfo_operations); return 0; } fs_initcall(proc_devices_init); static int devinfo_show(struct seq_file *f, void *v) { int i = *(loff_t *) v; if (i < CHRDEV_MAJOR_MAX) { if (i == 0) seq_puts(f, "Character devices:\n"); chrdev_show(f, i); } #ifdef CONFIG_BLOCK else { i -= CHRDEV_MAJOR_MAX; if (i == 0) seq_puts(f, "\nBlock devices:\n"); blkdev_show(f, i); } #endif return 0; } void chrdev_show(struct seq_file *f, off_t offset) { struct char_device_struct *cd; mutex_lock(&chrdevs_lock); for (cd = chrdevs[major_to_index(offset)]; cd; cd = cd->next) { if (cd->major == offset) seq_printf(f, "%3d %s\n", cd->major, cd->name); } mutex_unlock(&chrdevs_lock); } void blkdev_show(struct seq_file *seqf, off_t offset) { struct blk_major_name *dp; mutex_lock(&block_class_lock); for (dp = major_names[major_to_index(offset)]; dp; dp = dp->next) if (dp->major == offset) seq_printf(seqf, "%3d %s\n", dp->major, dp->name); mutex_unlock(&block_class_lock); }
3.8 proc/diskstats
diskstats显示disk统计信息。如下是示例,输出内容解析参考Documentation/iostats.txt:
cat /proc/diskstats 8 16 sdb 476 0 37508 15096 0 0 0 0 0 2608 15096 8 17 sdb1 50 0 4160 1780 0 0 0 0 0 1780 1780 8 18 sdb2 48 0 4144 1400 0 0 0 0 0 1380 1400 8 19 sdb3 48 0 4144 1952 0 0 0 0 0 1884 1952 8 20 sdb4 2 0 4 176 0 0 0 0 0 176 176 8 21 sdb5 46 0 4128 2588 0 0 0 0 0 1640 2588 8 22 sdb6 46 0 4128 1052 0 0 0 0 0 1004 1052 8 23 sdb7 48 0 4144 984 0 0 0 0 0 980 984 8 24 sdb8 50 0 4160 2536 0 0 0 0 0 2016 2536 8 25 sdb9 48 0 4144 1188 0 0 0 0 0 1188 1188 8 0 sda 28874 1231 1655786 10164 6376 11927 340456 6204 0 5348 16352 8 1 sda1 52 0 4176 24 0 0 0 0 0 16 24 8 2 sda2 28795 1231 1649538 10136 6225 11927 340456 6068 0 5216 16188
各列概要如下:
1 - major number 2 - minor mumber 3 - device name 4 - reads completed successfully 5 - reads merged 6 - sectors read 7 - time spent reading (ms) 8 - writes completed 9 - writes merged 10 - sectors written 11 - time spent writing (ms) 12 - I/Os currently in progress 13 - time spent doing I/Os (ms) 14 - weighted time spent doing I/Os (ms)
第4-14列说明如下:
Field 1 – # of reads completed
This is the total number of reads completed successfully.
Field 2 – # of reads merged, field 6 – # of writes merged
Reads and writes which are adjacent to each other may be merged for
efficiency. Thus two 4K reads may become one 8K read before it is
ultimately handed to the disk, and so it will be counted (and queued)
as only one I/O. This field lets you know how often this was done.
Field 3 – # of sectors read
This is the total number of sectors read successfully.
Field 4 – # of milliseconds spent reading
This is the total number of milliseconds spent by all reads (as
measured from __make_request() to end_that_request_last()).
Field 5 – # of writes completed
This is the total number of writes completed successfully.
Field 6 – # of writes merged
See the description of field 2.
Field 7 – # of sectors written
This is the total number of sectors written successfully.
Field 8 – # of milliseconds spent writing
This is the total number of milliseconds spent by all writes (as
measured from __make_request() to end_that_request_last()).
Field 9 – # of I/Os currently in progress
The only field that should go to zero. Incremented as requests are
given to appropriate struct request_queue and decremented as they finish.
Field 10 – # of milliseconds spent doing I/Os
This field increases so long as field 9 is nonzero.
Field 11 – weighted # of milliseconds spent doing I/Os
This field is incremented at each I/O start, I/O completion, I/O
merge, or read of these stats by the number of I/Os in progress
(field 9) times the number of milliseconds spent doing I/O since the
last update of this field. This can provide an easy measure of both
I/O completion time and the backlog that may be accumulating.
内核代码block/genhd.c:
static const struct seq_operations diskstats_op = { .start = disk_seqf_start, .next = disk_seqf_next, .stop = disk_seqf_stop, .show = diskstats_show }; static int diskstats_open(struct inode *inode, struct file *file) { return seq_open(file, &diskstats_op); } static const struct file_operations proc_diskstats_operations = { .open = diskstats_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init proc_genhd_init(void) { proc_create("diskstats", 0, NULL, &proc_diskstats_operations); proc_create("partitions", 0, NULL, &proc_partitions_operations); return 0; } module_init(proc_genhd_init); static int diskstats_show(struct seq_file *seqf, void *v) { struct gendisk *gp = v; struct disk_part_iter piter; struct hd_struct *hd; char buf[BDEVNAME_SIZE]; unsigned int inflight[2]; int cpu; /* if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next) seq_puts(seqf, "major minor name" " rio rmerge rsect ruse wio wmerge " "wsect wuse running use aveq" "\n\n"); */ disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0); while ((hd = disk_part_iter_next(&piter))) { cpu = part_stat_lock(); part_round_stats(gp->queue, cpu, hd); part_stat_unlock(); part_in_flight(gp->queue, hd, inflight); seq_printf(seqf, "%4d %7d %s %lu %lu %lu " "%u %lu %lu %lu %u %u %u %u\n", MAJOR(part_devt(hd)), MINOR(part_devt(hd)), disk_name(gp, hd->partno, buf), part_stat_read(hd, ios[READ]), part_stat_read(hd, merges[READ]), part_stat_read(hd, sectors[READ]), jiffies_to_msecs(part_stat_read(hd, ticks[READ])), part_stat_read(hd, ios[WRITE]), part_stat_read(hd, merges[WRITE]), part_stat_read(hd, sectors[WRITE]), jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])), inflight[0], jiffies_to_msecs(part_stat_read(hd, io_ticks)), jiffies_to_msecs(part_stat_read(hd, time_in_queue)) ); } disk_part_iter_exit(&piter); return 0; }
3.9 proc/dma
This is a list of the registered ISA DMA (direct memory access) channels in use.
~$ cat /proc/dma 4: cascade
内核代码kernel/dma.c:
static int proc_dma_open(struct inode *inode, struct file *file) { return single_open(file, proc_dma_show, NULL); } static const struct file_operations proc_dma_operations = { .open = proc_dma_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init proc_dma_init(void) { proc_create("dma", 0, NULL, &proc_dma_operations); return 0; } __initcall(proc_dma_init); #ifdef MAX_DMA_CHANNELS static int proc_dma_show(struct seq_file *m, void *v) { int i; for (i = 0 ; i < MAX_DMA_CHANNELS ; i++) { if (dma_chan_busy[i].lock) { seq_printf(m, "%2d: %s\n", i, dma_chan_busy[i].device_id); } } return 0; } #else static int proc_dma_show(struct seq_file *m, void *v) { seq_puts(m, "No DMA\n"); return 0; } #endif /* MAX_DMA_CHANNELS */
3.10 proc/execdomains
execdomains显示当前支持的可执行文件domains(ABI)。当前总是显示Linux [kernel]:
cat /proc/execdomains 0-0 Linux [kernel]
内核代码kernel/exec_domain.c:
#ifdef CONFIG_PROC_FS static int execdomains_proc_show(struct seq_file *m, void *v) { seq_puts(m, "0-0\tLinux \t[kernel]\n"); return 0; } static int execdomains_proc_open(struct inode *inode, struct file *file) { return single_open(file, execdomains_proc_show, NULL); } static const struct file_operations execdomains_proc_fops = { .open = execdomains_proc_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init proc_execdomains_init(void) { proc_create("execdomains", 0, NULL, &execdomains_proc_fops); return 0; } module_init(proc_execdomains_init); #endif SYSCALL_DEFINE1(personality, unsigned int, personality) { unsigned int old = current->personality; if (personality != 0xffffffff) set_personality(personality); return old; }
3.11 proc/fb
fb(frame buffer)显示当前系统fb信息:
~$ cat /proc/fb 0 inteldrmfb
内核代码drivers/video/fbdev/core/fbmem.c:
static int fb_seq_show(struct seq_file *m, void *v) { int i = *(loff_t *)v; struct fb_info *fi = registered_fb[i]; if (fi) seq_printf(m, "%d %s\n", fi->node, fi->fix.id); return 0; } static const struct seq_operations proc_fb_seq_ops = { .start = fb_seq_start, .next = fb_seq_next, .stop = fb_seq_stop, .show = fb_seq_show, }; static int proc_fb_open(struct inode *inode, struct file *file) { return seq_open(file, &proc_fb_seq_ops); } static const struct file_operations fb_proc_fops = { .owner = THIS_MODULE, .open = proc_fb_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, };
3.12 proc/filesystems
filesystems列举当前支持的文件系统类型:
~$ cat /proc/filesystems nodev sysfs nodev rootfs nodev ramfs nodev bdev nodev proc nodev cpuset nodev cgroup nodev cgroup2 nodev tmpfs nodev devtmpfs nodev debugfs nodev tracefs nodev securityfs nodev sockfs nodev bpf nodev pipefs nodev hugetlbfs nodev devpts nodev pstore nodev mqueue ext3 ext2 ext4 nodev autofs nodev binfmt_misc fuseblk nodev fuse nodev fusectl
内核代码fs/filesystems.c:
static int filesystems_proc_show(struct seq_file *m, void *v) { struct file_system_type * tmp; read_lock(&file_systems_lock); tmp = file_systems; while (tmp) { seq_printf(m, "%s\t%s\n", (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev", tmp->name); tmp = tmp->next; } read_unlock(&file_systems_lock); return 0; } static int filesystems_proc_open(struct inode *inode, struct file *file) { return single_open(file, filesystems_proc_show, NULL); } static const struct file_operations filesystems_proc_fops = { .open = filesystems_proc_open, .read = seq_read, .llseek = seq_lseek, .release = single_release, }; static int __init proc_filesystems_init(void) { proc_create("filesystems", 0, NULL, &filesystems_proc_fops); return 0; } module_init(proc_filesystems_init);
3.13 proc/fs
fs目录下包含子目录,每个子目录是一种当前正在使用的文件系统名称(比如ext4),每个 文件系统子目录下二级子目录是每个挂载的块设备(比如sda2),其下文件是其属性。
proc/fs是由fs/proc/root.c创建的:
void __init proc_root_init(void) { int err; proc_init_inodecache(); set_proc_pid_nlink(); err = register_filesystem(&proc_fs_type); if (err) return; proc_self_init(); proc_thread_self_init(); proc_symlink("mounts", NULL, "self/mounts"); proc_net_init(); #ifdef CONFIG_SYSVIPC proc_mkdir("sysvipc", NULL); #endif proc_mkdir("fs", NULL); proc_mkdir("driver", NULL); proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */ #if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE) /* just give it a mountpoint */ proc_create_mount_point("openprom"); #endif proc_tty_init(); proc_mkdir("bus", NULL); proc_sys_init(); }
每个子目录是由不同的文件系统创建的,比如ext4: fs/ext4/sysfs.c:
static const char proc_dirname[] = "fs/ext4"; static struct proc_dir_entry *ext4_proc_root; int __init ext4_init_sysfs(void) { int ret; kobject_set_name(&ext4_kset.kobj, "ext4"); ext4_kset.kobj.parent = fs_kobj; ret = kset_register(&ext4_kset); if (ret) return ret; ret = kobject_init_and_add(&ext4_feat, &ext4_feat_ktype, NULL, "features"); if (ret) kset_unregister(&ext4_kset); else ext4_proc_root = proc_mkdir(proc_dirname, NULL); return ret; } static const struct ext4_proc_files { const char *name; const struct file_operations *fops; } proc_files[] = { PROC_FILE_LIST(options), PROC_FILE_LIST(es_shrinker_info), PROC_FILE_LIST(mb_groups), { NULL, NULL }, }; int ext4_register_sysfs(struct super_block *sb) { struct ext4_sb_info *sbi = EXT4_SB(sb); const struct ext4_proc_files *p; int err; sbi->s_kobj.kset = &ext4_kset; init_completion(&sbi->s_kobj_unregister); err = kobject_init_and_add(&sbi->s_kobj, &ext4_sb_ktype, NULL, "%s", sb->s_id); if (err) return err; if (ext4_proc_root) sbi->s_proc = proc_mkdir(sb->s_id, ext4_proc_root); if (sbi->s_proc) { for (p = proc_files; p->name; p++) proc_create_data(p->name, S_IRUGO, sbi->s_proc, p->fops, sb); } return 0; }
3.14 proc/interrups
interrupts显示每CPU每设备的中断数,此文件信息可用来分析中断是否均衡:
~$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 18 0 0 0 IR-IO-APIC 2-edge timer 1: 3 1 1 8 IR-IO-APIC 1-edge i8042 8: 1 0 0 0 IR-IO-APIC 8-edge rtc0 9: 1076 2273 812 836 IR-IO-APIC 9-fasteoi acpi 12: 42 475 34 51 IR-IO-APIC 12-edge i8042 16: 2 23 2 2 IR-IO-APIC 16-fasteoi ehci_hcd:usb1 18: 0 0 1 0 IR-IO-APIC 18-fasteoi i801_smbus 23: 3 28 0 2 IR-IO-APIC 23-fasteoi ehci_hcd:usb2 24: 0 0 0 0 DMAR-MSI 0-edge dmar0 25: 0 0 0 0 DMAR-MSI 1-edge dmar1 28: 17 1 3 1 IR-PCI-MSI 1572864-edge rtsx_pci 29: 160 1281 141 110 IR-PCI-MSI 409600-edge enp0s25 30: 7815 43472 5581 5237 IR-PCI-MSI 327680-edge xhci_hcd 31: 29423 41453 16894 16308 IR-PCI-MSI 512000-edge ahci[0000:00:1f.2] 32: 21 0 0 0 IR-PCI-MSI 360448-edge mei_me 33: 234 129 18 5 IR-PCI-MSI 442368-edge snd_hda_intel:card1 34: 17 25 9 6 IR-PCI-MSI 1048576-edge 35: 21776 66376 16760 12078 IR-PCI-MSI 32768-edge i915 36: 7089 20303 4455 3968 IR-PCI-MSI 2097152-edge iwlwifi 37: 116 855 136 140 IR-PCI-MSI 49152-edge snd_hda_intel:card0 NMI: 0 0 0 0 Non-maskable interrupts LOC: 310517 250892 326730 272651 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 Performance monitoring interrupts IWI: 0 0 2 0 IRQ work interrupts RTR: 0 0 0 0 APIC ICR read retries RES: 32446 33569 28799 18822 Rescheduling interrupts CAL: 50777 60493 62390 53925 Function call interrupts TLB: 48661 58330 60778 51797 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts DFR: 0 0 0 0 Deferred Error APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 12 12 12 12 Machine check polls ERR: 0 MIS: 0 PIN: 0 0 0 0 Posted-interrupt notification event PIW: 0 0 0 0 Posted-interrupt wakeup event
内核代码fs/proc/interrupts.c和kernel/irq/proc.c:
static const struct seq_operations int_seq_ops = { .start = int_seq_start, .next = int_seq_next, .stop = int_seq_stop, .show = show_interrupts }; static int interrupts_open(struct inode *inode, struct file *filp) { return seq_open(filp, &int_seq_ops); } static const struct file_operations proc_interrupts_operations = { .open = interrupts_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init proc_interrupts_init(void) { proc_create("interrupts", 0, NULL, &proc_interrupts_operations); return 0; } fs_initcall(proc_interrupts_init); int show_interrupts(struct seq_file *p, void *v) { static int prec; unsigned long flags, any_count = 0; int i = *(loff_t *) v, j; struct irqaction *action; struct irq_desc *desc; if (i > ACTUAL_NR_IRQS) return 0; if (i == ACTUAL_NR_IRQS) return arch_show_interrupts(p, prec); /* print header and calculate the width of the first column */ if (i == 0) { for (prec = 3, j = 1000; prec < 10 && j <= nr_irqs; ++prec) j *= 10; seq_printf(p, "%*s", prec + 8, ""); for_each_online_cpu(j) seq_printf(p, "CPU%-8d", j); seq_putc(p, '\n'); } irq_lock_sparse(); desc = irq_to_desc(i); if (!desc) goto outsparse; raw_spin_lock_irqsave(&desc->lock, flags); for_each_online_cpu(j) any_count |= kstat_irqs_cpu(i, j); action = desc->action; if ((!action || irq_desc_is_chained(desc)) && !any_count) goto out; seq_printf(p, "%*d: ", prec, i); for_each_online_cpu(j) seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); if (desc->irq_data.chip) { if (desc->irq_data.chip->irq_print_chip) desc->irq_data.chip->irq_print_chip(&desc->irq_data, p); else if (desc->irq_data.chip->name) seq_printf(p, " %8s", desc->irq_data.chip->name); else seq_printf(p, " %8s", "-"); } else { seq_printf(p, " %8s", "None"); } if (desc->irq_data.domain) seq_printf(p, " %*d", prec, (int) desc->irq_data.hwirq); else seq_printf(p, " %*s", prec, ""); #ifdef CONFIG_GENERIC_IRQ_SHOW_LEVEL seq_printf(p, " %-8s", irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge"); #endif if (desc->name) seq_printf(p, "-%-8s", desc->name); if (action) { seq_printf(p, " %s", action->name); while ((action = action->next) != NULL) seq_printf(p, ", %s", action->name); } seq_putc(p, '\n'); out: raw_spin_unlock_irqrestore(&desc->lock, flags); outsparse: irq_unlock_sparse(); return 0; }
3.15 iomem
I/O内存映射表。包括预留、BIOS、显卡、内存、PCI地址空间等。部分PCIe地址空间因安全 原因,BIOS不暴露给OS,此时显示为reserved。
~$ cat /proc/iomem 00000000-00000fff : reserved 00001000-0009d7ff : System RAM 0009d800-0009ffff : reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000c7fff : Video ROM 000c8000-000cbfff : pnp 00:00 000cc000-000cffff : pnp 00:00 000d0000-000d3fff : pnp 00:00 000d4000-000d7fff : pnp 00:00 000d8000-000dbfff : pnp 00:00 000dc000-000dffff : pnp 00:00 000e0000-000fffff : reserved 000f0000-000fffff : System ROM 00100000-1fffffff : System RAM 01000000-01519c00 : Kernel code 01519c01-018ecdff : Kernel data 01a21000-01af2fff : Kernel bss 20000000-201fffff : reserved 20200000-40003fff : System RAM 40004000-40004fff : reserved 40005000-cdba6fff : System RAM cdba7000-dae9efff : reserved dae9f000-daf9efff : ACPI Non-volatile Storage daf9f000-daffefff : ACPI Tables dafff000-df9fffff : reserved dba00000-df9fffff : Graphics Stolen Memory dfa00000-febfffff : PCI Bus 0000:00 e0000000-efffffff : 0000:00:02.0 f0000000-f03fffff : 0000:00:02.0 f0400000-f0bfffff : PCI Bus 0000:02 f0c00000-f13fffff : PCI Bus 0000:04 f1400000-f1bfffff : PCI Bus 0000:04 f1c00000-f1cfffff : PCI Bus 0000:03 f1c00000-f1c01fff : 0000:03:00.0 f1c00000-f1c01fff : iwlwifi f1d00000-f24fffff : PCI Bus 0000:02 f1d00000-f1d000ff : 0000:02:00.0 f1d00000-f1d000ff : mmc0 f2500000-f251ffff : 0000:00:19.0 f2500000-f251ffff : e1000e f2520000-f252ffff : 0000:00:14.0 f2520000-f252ffff : xhci_hcd f2530000-f2533fff : 0000:00:1b.0 f2530000-f2533fff : ICH HD audio f2534000-f25340ff : 0000:00:1f.3 f2535000-f253500f : 0000:00:16.0 f2535000-f253500f : mei_me f2538000-f25387ff : 0000:00:1f.2 f2538000-f25387ff : ahci f2539000-f25393ff : 0000:00:1d.0 f2539000-f25393ff : ehci_hcd f253a000-f253a3ff : 0000:00:1a.0 f253a000-f253a3ff : ehci_hcd f253b000-f253bfff : 0000:00:19.0 f253b000-f253bfff : e1000e f253c000-f253cfff : 0000:00:16.3 f8000000-fbffffff : PCI MMCONFIG 0000 [bus 00-3f] f8000000-fbffffff : reserved f8000000-fbffffff : pnp 00:01 fec00000-fec00fff : reserved fec00000-fec003ff : IOAPIC 0 fed00000-fed003ff : HPET 0 fed00000-fed003ff : PNP0103:00 fed08000-fed08fff : reserved fed10000-fed19fff : reserved fed10000-fed17fff : pnp 00:01 fed18000-fed18fff : pnp 00:01 fed19000-fed19fff : pnp 00:01 fed1c000-fed1ffff : reserved fed1c000-fed1ffff : pnp 00:01 fed1f410-fed1f414 : iTCO_wdt fed1f410-fed1f414 : iTCO_wdt fed40000-fed4bfff : PCI Bus 0000:00 fed45000-fed4bfff : pnp 00:01 fed90000-fed90fff : dmar0 fed91000-fed91fff : dmar1 fee00000-fee00fff : Local APIC fee00000-fee00fff : reserved ffc00000-ffffffff : reserved fffff000-ffffffff : pnp 00:01 100000000-41e5fffff : System RAM 41e600000-41effffff : reserved 41f000000-41fffffff : RAM buffer
内核代码kernel/resource.c:
#ifdef CONFIG_PROC_FS enum { MAX_IORES_LEVEL = 5 }; static void *r_start(struct seq_file *m, loff_t *pos) __acquires(resource_lock) { struct resource *p = m->private; loff_t l = 0; read_lock(&resource_lock); for (p = p->child; p && l < *pos; p = r_next(m, p, &l)) ; return p; } static void r_stop(struct seq_file *m, void *v) __releases(resource_lock) { read_unlock(&resource_lock); } static int r_show(struct seq_file *m, void *v) { struct resource *root = m->private; struct resource *r = v, *p; unsigned long long start, end; int width = root->end < 0x10000 ? 4 : 8; int depth; for (depth = 0, p = r; depth < MAX_IORES_LEVEL; depth++, p = p->parent) if (p->parent == root) break; if (file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) { start = r->start; end = r->end; } else { start = end = 0; } seq_printf(m, "%*s%0*llx-%0*llx : %s\n", depth * 2, "", width, start, width, end, r->name ? r->name : "<BAD>"); return 0; } static const struct seq_operations resource_op = { .start = r_start, .next = r_next, .stop = r_stop, .show = r_show, }; static int ioports_open(struct inode *inode, struct file *file) { int res = seq_open(file, &resource_op); if (!res) { struct seq_file *m = file->private_data; m->private = &ioport_resource; } return res; } static int iomem_open(struct inode *inode, struct file *file) { int res = seq_open(file, &resource_op); if (!res) { struct seq_file *m = file->private_data; m->private = &iomem_resource; } return res; } static const struct file_operations proc_ioports_operations = { .open = ioports_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static const struct file_operations proc_iomem_operations = { .open = iomem_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, }; static int __init ioresources_init(void) { proc_create("ioports", 0, NULL, &proc_ioports_operations); proc_create("iomem", 0, NULL, &proc_iomem_operations); return 0; } __initcall(ioresources_init); #endif /* CONFIG_PROC_FS */
3.16 proc/ioports
注册的IO端口区间,用inb、outb访问:
~$ cat /proc/ioports 0000-0cf7 : PCI Bus 0000:00 0000-001f : dma1 0020-0021 : pic1 0040-0043 : timer0 0050-0053 : timer1 0060-0060 : keyboard 0061-0061 : PNP0800:00 0062-0062 : PNP0C09:00 0062-0062 : EC data 0064-0064 : keyboard 0066-0066 : PNP0C09:00 0066-0066 : EC cmd 0070-0071 : rtc0 0080-008f : dma page reg 00a0-00a1 : pic2 00c0-00df : dma2 00f0-00ff : fpu 00f0-00f0 : PNP0C04:00 03c0-03df : vga+ 0400-0403 : ACPI PM1a_EVT_BLK 0404-0405 : ACPI PM1a_CNT_BLK 0408-040b : ACPI PM_TMR 0410-0415 : ACPI CPU throttle 0420-042f : ACPI GPE0_BLK 0430-0433 : iTCO_wdt 0430-0433 : iTCO_wdt 0450-0450 : ACPI PM2_CNT_BLK 0460-047f : iTCO_wdt 0460-047f : iTCO_wdt 0500-057f : pnp 00:01 0800-080f : pnp 00:01 0cf8-0cff : PCI conf1 0d00-ffff : PCI Bus 0000:00 15e0-15ef : pnp 00:01 1600-167f : pnp 00:01 3000-3fff : PCI Bus 0000:04 4000-4fff : PCI Bus 0000:02 5000-503f : 0000:00:02.0 5060-507f : 0000:00:1f.2 5060-507f : ahci 5080-509f : 0000:00:19.0 50a0-50a7 : 0000:00:1f.2 50a0-50a7 : ahci 50a8-50af : 0000:00:1f.2 50a8-50af : ahci 50b0-50b7 : 0000:00:16.3 50b0-50b7 : serial 50b8-50bb : 0000:00:1f.2 50b8-50bb : ahci 50bc-50bf : 0000:00:1f.2 50bc-50bf : ahci efa0-efbf : 0000:00:1f.3 efa0-efbf : i801_smbus
内核代码kernel/resource.c,参见iomem代码。
3.17 proc/kallsyms
内核导出的符号定义,供module(X)工具链接和绑定使用:
~$ cat /proc/kallsyms 0000000000000000 A irq_stack_union 0000000000000000 A __per_cpu_start ffffffff810002b8 T _stext ffffffff81001000 T hypercall_page ffffffff81001000 T xen_hypercall_set_trap_table ffffffff81001020 T xen_hypercall_mmu_update ffffffff81001040 T xen_hypercall_set_gdt ffffffff81001060 T xen_hypercall_stack_switch ffffffff81001080 T xen_hypercall_set_callbacks ffffffff810010a0 T xen_hypercall_fpu_taskswitch ffffffff810010c0 T xen_hypercall_sched_op_compat ffffffff810010e0 T xen_hypercall_platform_op ffffffff81001100 T xen_hypercall_set_debugreg ffffffff81001120 T xen_hypercall_get_debugreg ffffffff81001140 T xen_hypercall_update_descriptor ffffffff81001160 T xen_hypercall_ni ffffffff81001180 T xen_hypercall_memory_op ... ffffffff8113fb60 T find_get_entries ffffffff8113fc90 T find_get_pages ffffffff8113fdd0 T mempool_kfree ffffffff8113fde0 T mempool_alloc_slab ffffffff8113fe00 T mempool_free_slab ffffffff8113fe20 T mempool_alloc_pages ffffffff8113fe30 T mempool_free_pages ffffffff8113fe40 t remove_element.isra.1 ffffffff8113fe60 T mempool_destroy ffffffff8113fec0 T mempool_alloc ffffffff81140010 t add_element ffffffff81140030 T mempool_free ffffffff811400c0 T mempool_create_node ffffffff81140200 T mempool_create ffffffff81140220 T mempool_resize ffffffff811403c0 T mempool_kmalloc
3.18 proc/meminfo
~$ cat /proc/meminfo MemTotal: 16145220 kB MemFree: 14074612 kB MemAvailable: 14689140 kB Buffers: 88384 kB Cached: 878512 kB SwapCached: 0 kB Active: 1272364 kB Inactive: 597024 kB Active(anon): 903844 kB Inactive(anon): 217676 kB Active(file): 368520 kB Inactive(file): 379348 kB Unevictable: 120 kB Mlocked: 120 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 100 kB Writeback: 0 kB AnonPages: 902604 kB ...
3.19 proc/<pid>
3.19.1 proc/<pid>/oom_adj
调整进程OOM-killing亲和度。有效范围[-17, +15]。-17是特殊值,禁止OOM-Killing。 数值越大,OOM时被选中的可能性越大。默认为0,需要CAP_SYS_RESOURCE权限修改此 值。2.6.36之后版本不建议使用此方式,请使用/proc/<pid>/oom_score_adj代替。
3.19.2 proc/<pid>/oom_score
查看当前进程OOM-killer分值。分数越高越容易被OOM-Killer选中。基本分值与进程占用 内存相关,且随着fork数量、CPU占用、nice、privileged、是否直接存取硬件调整:
This file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process, with increases (+) or decreases (-) for factors including:
- whether the process creates a lot of children using fork(2) (+);
- whether the process has been running a long time, or has used a lot of CPU time (-);
- whether the process has a low nice value (i.e., > 0) (+);
- whether the process is privileged (-); and
- whether the process is making direct hardware access (-).
- The oom_score also reflects the adjustment specified by the oom_score_adj or oom_adj setting for the
process.
3.19.3 proc/<pid>/oom_score_adj
调整OOM-killer选择该进程的坏值。数值范围[-1000, +1000]。-1000表示禁止被选中。重要的 常驻程序可以设置-1000禁止进程被OOM-killer选中杀死:
#include <errno.h> #include <stdio.h> static int disable_oom() { FILE *fp = fopen("/proc/self/oom_score_adj", "w"); if (!fp) { fprintf(stderr, "open oom_score_adj failed\n"); return -1; } fprintf(fp, "%i", -1000); fclose(fp); return 0; } int main() { int ret; ret = disable_oom(); if (0 == ret) printf("disable oom success\n"); // do post work ... return 0; }
disable oom success
3.19.4 proc/<pid>/stack
进程内核态栈调用符号跟踪。
~$ cat /proc/self/stack [<ffffffff810695d9>] do_wait+0x1d9/0x230 [<ffffffff8106a637>] SyS_wait4+0x67/0xe0 [<ffffffff81068430>] child_wait_callback+0x0/0x60 [<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15 [<ffffffffffffffff>] 0xffffffffffffffff ~$ cat /proc/self/stack [<ffffffff81021bae>] save_stack_trace_tsk+0x1e/0x40 [<ffffffff81208ebd>] proc_pid_stack+0x8d/0xe0 [<ffffffff81209ac7>] proc_single_show+0x47/0x80 [<ffffffff811ca132>] seq_read+0xe2/0x360 [<ffffffff811a8723>] vfs_read+0x93/0x170 [<ffffffff811a9352>] SyS_read+0x42/0xa0 [<ffffffff81516a28>] page_fault+0x28/0x30 [<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15 [<ffffffffffffffff>] 0xffffffffffffffff ~$ sudo cat /proc/1/stack [<ffffffff811e8730>] ep_send_events_proc+0x0/0x1b0 [<ffffffff811e9079>] ep_scan_ready_list.isra.7+0x199/0x1c0 [<ffffffff811e931a>] ep_poll+0x25a/0x340 [<ffffffff810970a0>] default_wake_function+0x0/0x10 [<ffffffff811ea7a4>] SyS_epoll_wait+0xb4/0xe0 [<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15 [<ffffffffffffffff>] 0xffffffffffffffff
3.19.5 proc/<pid>/stat
进程状态信息,ps使用此处信息,对应代码kernel/fs/proc/array.c,截取部分意义代码:
/* * The task state array is a strange "bitmap" of * reasons to sleep. Thus "running" is zero, and * you can test for combinations of others with * simple bit tests. */ static const char * const task_state_array[] = { /* states in TASK_REPORT: */ "R (running)", /* 0x00 */ "S (sleeping)", /* 0x01 */ "D (disk sleep)", /* 0x02 */ "T (stopped)", /* 0x04 */ "t (tracing stop)", /* 0x08 */ "X (dead)", /* 0x10 */ "Z (zombie)", /* 0x20 */ "P (parked)", /* 0x40 */ /* states beyond TASK_REPORT: */ "I (idle)", /* 0x80 */ };
Man页有各细节描述:
~$ cat /proc/self/stat 1897 (bash) S 1892 1897 1897 34817 6762 4202496 76561 174332 1 115 113 36 134 64 20 0 1 0 3871 31305728 1893 18446744073709551615 4194304 5184116 140734690585584 140734690584264 140388838861628 0 65536 3670020 1266777851 0 0 0 17 2 0 0 2 0 0 7282144 7319112 11919360 140734690589286 140734690589292 140734690589292 140734690590702 0
proc[pid]/stat Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.
The fields, in order, with their proper scanf(3) format specifiers, are:
(1) pid %d The process ID.
(2) comm %s The filename of the executable, in parentheses. This is visible whether or not the executable is swapped out.
(3) state %c One of the following characters, indicating process state:
R Running
S Sleeping in an interruptible wait
D Waiting in uninterruptible disk sleep
Z Zombie
T Stopped (on a signal) or (before Linux 2.6.33) trace stopped
t Tracing stop (Linux 2.6.33 onward)
W Paging (only before Linux 2.6.0)
X Dead (from Linux 2.6.0 onward)
x Dead (Linux 2.6.33 to 3.13 only)
K Wakekill (Linux 2.6.33 to 3.13 only)
W Waking (Linux 2.6.33 to 3.13 only)
P Parked (Linux 3.9 to 3.13 only)
(4) ppid %d The PID of the parent of this process.
(5) pgrp %d The process group ID of the process.
(6) session %d The session ID of the process.
(7) tty_nr %d The controlling terminal of the process. (The minor device number is contained in the combination of bits 31 to 20 and 7 to 0; the major device number is in bits 15 to 8.)
(8) tpgid %d The ID of the foreground process group of the controlling terminal of the process.
(9) flags %u The kernel flags word of the process. For bit meanings, see the PF_* defines in the Linux kernel source file include/linux/sched.h. Details depend on the kernel ver‐ sion.
The format for this field was %lu before Linux 2.6.
(1) minflt %lu The number of minor faults the process has made which have not required loading a memory page from disk.
(11) cminflt %lu The number of minor faults that the process's waited-for children have made.
(12) majflt %lu The number of major faults the process has made which have required loading a memory page from disk.
(13) cmajflt %lu The number of major faults that the process's waited-for children have made.
(14) utime %lu Amount of time that this process has been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). This includes guest time, guest_time (time spent running a virtual CPU, see below), so that applications that are not aware of the guest time field do not lose that time from their calculations.
(15) stime %lu Amount of time that this process has been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
(16) cutime %ld Amount of time that this process's waited-for children have been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). (See also times(2).) This includes guest time, cguest_time (time spent running a virtual CPU, see below).
(17) cstime %ld Amount of time that this process's waited-for children have been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
(18) priority %ld (Explanation for Linux 2.6) For processes running a real- time scheduling policy (policy below; see sched_setsched‐ uler(2)), this is the negated scheduling priority, minus one; that is, a number in the range -2 to -100, corre‐ sponding to real-time priorities 1 to 99. For processes running under a non-real-time scheduling policy, this is the raw nice value (setpriority(2)) as represented in the kernel. The kernel stores nice values as numbers in the range 0 (high) to 39 (low), corresponding to the user- visible nice range of -20 to 19.
Before Linux 2.6, this was a scaled value based on the scheduler weighting given to this process.
(19) nice %ld The nice value (see setpriority(2)), a value in the range 19 (low priority) to -20 (high priority).
(20) num_threads %ld Number of threads in this process (since Linux 2.6). Before kernel 2.6, this field was hard coded to 0 as a placeholder for an earlier removed field.
(21) itrealvalue %ld The time in jiffies before the next SIGALRM is sent to the process due to an interval timer. Since kernel 2.6.17, this field is no longer maintained, and is hard coded as 0.
(22) starttime %llu The time the process started after system boot. In ker‐ nels before Linux 2.6, this value was expressed in jiffies. Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)).
The format for this field was %lu before Linux 2.6.
(23) vsize %lu Virtual memory size in bytes.
(24) rss %ld Resident Set Size: number of pages the process has in real memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.
(25) rsslim %lu Current soft limit in bytes on the rss of the process; see the description of RLIMIT_RSS in getrlimit(2).
(26) startcode %lu The address above which program text can run.
(27) endcode %lu The address below which program text can run.
(28) startstack %lu The address of the start (i.e., bottom) of the stack.
(29) kstkesp %lu The current value of ESP (stack pointer), as found in the kernel stack page for the process.
(30) kstkeip %lu The current EIP (instruction pointer).
(31) signal %lu The bitmap of pending signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.
(32) blocked %lu The bitmap of blocked signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.
(33) sigignore %lu The bitmap of ignored signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.
(34) sigcatch %lu The bitmap of caught signals, displayed as a decimal num‐ ber. Obsolete, because it does not provide information on real-time signals; use proc[pid]/status instead.
(35) wchan %lu This is the "channel" in which the process is waiting. It is the address of a location in the kernel where the process is sleeping. The corresponding symbolic name can be found in proc[pid]/wchan.
(36) nswap %lu Number of pages swapped (not maintained).
(37) cnswap %lu Cumulative nswap for child processes (not maintained).
(38) exit_signal %d (since Linux 2.1.22) Signal to be sent to parent when we die.
(39) processor %d (since Linux 2.2.8) CPU number last executed on.
(40) rt_priority %u (since Linux 2.5.19) Real-time scheduling priority, a number in the range 1 to 99 for processes scheduled under a real-time policy, or 0, for non-real-time processes (see sched_setsched‐ uler(2)).
(41) policy %u (since Linux 2.5.19) Scheduling policy (see sched_setscheduler(2)). Decode using the SCHED_* constants in linux/sched.h.
The format for this field was %lu before Linux 2.6.22.
(42) delayacct_blkio_ticks %llu (since Linux 2.6.18) Aggregated block I/O delays, measured in clock ticks (centiseconds).
(43) guest_time %lu (since Linux 2.6.24) Guest time of the process (time spent running a virtual CPU for a guest operating system), measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
(44) cguest_time %ld (since Linux 2.6.24) Guest time of the process's children, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).
(45) start_data %lu (since Linux 3.3) Address above which program initialized and uninitialized (BSS) data are placed.
(46) end_data %lu (since Linux 3.3) Address below which program initialized and uninitialized (BSS) data are placed.
(47) start_brk %lu (since Linux 3.3) Address above which program heap can be expanded with brk(2).
(48) arg_start %lu (since Linux 3.5) Address above which program command-line arguments (argv) are placed.
(49) arg_end %lu (since Linux 3.5) Address below program command-line arguments (argv) are placed.
(50) env_start %lu (since Linux 3.5) Address above which program environment is placed.
(51) env_end %lu (since Linux 3.5) Address below which program environment is placed.
(52) exit_code %d (since Linux 3.5) The thread's exit status in the form reported by wait‐ pid(2).
3.19.6 proc/<pid>/statm
进程内存使用评估,以Page为单位。
~$ cat statm 7643 1893 850 242 0 1040 0
proc[pid]/statm Provides information about memory usage, measured in pages. The columns are:
size (1) total program size (same as VmSize in proc[pid]/status) resident (2) resident set size (same as VmRSS in proc[pid]/status) share (3) shared pages (i.e., backed by a file) text (4) text (code) lib (5) library (unused in Linux 2.6) data (6) data + stack dt (7) dirty pages (unused in Linux 2.6)
3.19.7 proc/<pid>/status
可读方式提供更多进程信息:
~$ cat /proc/self/status Name: cat State: R (running) Tgid: 6955 Ngid: 0 Pid: 6955 PPid: 1897 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 24 25 29 30 44 46 108 110 113 118 1000 VmPeak: 11012 kB VmSize: 11012 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 712 kB VmRSS: 712 kB VmData: 324 kB VmStk: 136 kB VmExe: 48 kB VmLib: 1796 kB VmPTE: 40 kB VmSwap: 0 kB Threads: 1 SigQ: 0/63000 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000003fffffffff Seccomp: 0 Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1
- Name: Command run by this process.
- State: Current state of the process. One of "R (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T (tracing stop)", "Z (zombie)", or "X (dead)".
- Tgid: Thread group ID (i.e., Process ID).
- Pid: Thread ID (see gettid(2)).
- PPid: PID of parent process.
- TracerPid: PID of process tracing this process (0 if not being traced).
- Uid, Gid: Real, effective, saved set, and filesystem UIDs (GIDs).
- FDSize: Number of file descriptor slots currently allocated.
- Groups: Supplementary group list.
- VmPeak: Peak virtual memory size.
- VmSize: Virtual memory size.
- VmLck: Locked memory size (see mlock(3)).
- VmHWM: Peak resident set size ("high water mark").
- VmRSS: Resident set size.
- VmData, VmStk, VmExe: Size of data, stack, and text segments.
- VmLib: Shared library code size.
- VmPTE: Page table entries size (since Linux 2.6.10).
- Threads: Number of threads in process containing this thread.
- SigQ: This field contains two slash-separated numbers that relate to queued signals for the real user ID of this process. The first of these is the number of currently queued signals for this real user ID, and the second is the resource limit on the number of queued signals for this process (see the description of RLIMIT_SIGPENDING in getrlimit(2)).
- SigPnd, ShdPnd: Number of signals pending for thread and for process as a whole (see pthreads(7) and signal(7)).
- SigBlk, SigIgn, SigCgt: Masks indicating signals being blocked, ignored, and caught (see signal(7)).
- CapInh, CapPrm, CapEff: Masks of capabilities enabled in inheri‐ table, permitted, and effective sets (see capabilities(7)).
- CapBnd: Capability Bounding set (since Linux 2.6.26, see capabil‐ ities(7)).
- Cpus_allowed: Mask of CPUs on which this process may run (since Linux 2.6.24, see cpuset(7)).
- Cpus_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
- Mems_allowed: Mask of memory nodes allowed to this process (since Linux 2.6.24, see cpuset(7)).
- Mems_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
- voluntary_ctxt_switches, nonvoluntary_ctxt_switches: Number of voluntary and involuntary context switches (since Linux 2.6.23).
3.19.8 proc/<pid>/syscall
进程最近系统调用的信息。第一列为调用号,其后是stack地址、ecx,和6个通用寄存器信息。
~$ cat /proc/self/syscall 0 0x3 0x7f065d1ad000 0x20000 0x7ffcb1120dd0 0xffffffff 0x0 0x7ffcb1120f70 0x7f065cce0ba0
3.20 proc/self
proc/self是一个符号链接,总是指向执行进程本身(/proc/<pid>)。
~$ ls -ld /proc/self lrwxrwxrwx 1 root root 0 Dec 20 18:47 /proc/self -> 2289
3.21 oom_score
4 内核实现
5 参考资料
- IBM developer procfs、seq_file、debugfs and relayfs
- https://www.ibm.com/developerworks/cn/linux/l-kerns-usrs2/
- LWN Driver porting: The seq_file interface
- https://lwn.net/Articles/22355/