Java HashMap 相关

源代码

JDK 1.7 源代码:jdk/jdk/src/share/classes/java/util/HashMap.java at jdk7-b147 · openjdk/jdk

JDK 1.8 源代码:jdk/jdk/src/share/classes/java/util/HashMap.java at jdk8-b120 · openjdk/jdk

HashMap

JDK 8 中的 HashMap 的 Javadoc 解读

Hash table based implementation of the Map interface. This implementation provides all of the optional map operations, and permits null values and the null key. (The HashMap class is roughly equivalent to Hashtable, except that it is unsynchronized and permits nulls.) This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

HashMap 是 Map 接口的一个实现类,基于哈希表实现。这个实现提供了所有可选的 map 操作,同时允许键和值为 null。(HashMap 和 HashTable 大致相同,只是 HashMap 不是同步的,而且允许键和值为 null。)这个类不能保证 map 中的元素和插入顺序一致,也不能保证元素间的顺序保持不变。

This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. Iteration over collection views requires time proportional to the “capacity” of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it’s very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.

这个实现对于 get 和 put 这种基本操作,可以保证常数级别性能表现,可以认为采用的哈希函数把元素均匀地分布到了所有的桶中。

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table. Note that using many keys with the same hashCode() is a sure way to slow down performance of any hash table. To ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties.

Note that this implementation is not synchronized. If multiple threads access a hash map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with a key that an instance already contains is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be “wrapped” using the Collections.synchronizedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map:

1
2
> Map m = Collections.synchronizedMap(new HashMap(...));
>

The iterators returned by all of this class’s “collection view methods” are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator’s own remove method, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future.

Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.

Implementation notes

This map usually acts as a binned (bucketed) hash table, but when bins get too large, they are transformed into bins of TreeNodes, each structured similarly to those in java.util.TreeMap. Most methods try to use normal bins, but relay to TreeNode methods when applicable (simply by checking instanceof a node). Bins of TreeNodes may be traversed and used like any others, but additionally support faster lookup when overpopulated. However, since the vast majority of bins in normal use are not overpopulated, checking for existence of tree bins may be delayed in the course of table methods.

Tree bins (i.e., bins whose elements are all TreeNodes) are ordered primarily by hashCode, but in the case of ties, if two elements are of the same “class C implements Comparable\“, type then their compareTo method is used for ordering. (We conservatively check generic types via reflection to validate this — see method comparableClassFor). The added complexity of tree bins is worthwhile in providing worst-case O(log n) operations when keys either have distinct hashes or are orderable, Thus, performance degrades gracefully under accidental or malicious usages in which hashCode() methods return values that are poorly distributed, as well as those in which many keys share a hashCode, so long as they are also Comparable. (If neither of these apply, we may waste about a factor of two in time and space compared to taking no precautions. But the only known cases stem from poor user programming practices that are already so slow that this makes little difference.)

Because TreeNodes are about twice the size of regular nodes, we use them only when bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they become too small (due to removal or resizing) they are converted back to plain bins. In usages with well-distributed user hashCodes, tree bins are rarely used. Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution (http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) pow(0.5, k) / factorial(k)). The first values are:

0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million

The root of a tree bin is normally its first node. However, sometimes (currently only upon Iterator.remove), the root might be elsewhere, but can be recovered following parent links (method TreeNode.root()).

All applicable internal methods accept a hash code as an argument (as normally supplied from a public method), allowing them to call each other without recomputing user hashCodes. Most internal methods also accept a “tab” argument, that is normally the current table, but may be a new or old one when resizing or converting.

When bin lists are treeified, split, or untreeified, we keep them in the same relative access/traversal order (i.e., field Node.next) to better preserve locality, and to slightly simplify handling of splits and traversals that invoke iterator.remove. When using comparators on insertion, to keep a total ordering (or as close as is required here) across rebalancings, we compare classes and identityHashCodes as tie-breakers.

The use and transitions among plain vs tree modes is complicated by the existence of subclass LinkedHashMap. See below for hook methods defined to be invoked upon insertion, removal and access that allow LinkedHashMap internals to otherwise remain independent of these mechanics. (This also requires that a map instance be passed to some utility methods that may create new nodes.)

The concurrent-programming-like SSA-based coding style helps avoid aliasing errors amid all of the twisty pointer operations.

put() 方法分析

1
2
3
public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}
1
2
3
4
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

hash() 方法

hash 方法实际上是取 key 的 hashCode 的后 n 位。如果不同 key 的后 n 位相同,而前 32-n 位不同,按理说是应该分配到不同的桶中的,但是如果不做处理,这些元素就全都会分配到同一个桶中。导致单个桶中的元素数量过多,会导致查询速度很慢。

如果把前 16 位和后 16 位结合起来做一个运算,就可以起到扰动的效果。

参考资料:

JDK 源码中 HashMap 的 hash 方法原理是什么? - 知乎

全网把Map中的hash()分析的最透彻的文章,别无二家。-HollisChuang’s Blog

视角分析

扩容

扩容看 resize() 方法,HashMap 中扩容默认会把内部桶的数量翻倍。参考 resize() 方法的 Javadoc:

Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.

初始化和扩容都是使用的 resize() 方法。

常见坑点

遍历时不能 put

使用迭代器遍历的时候如果调用 put 会导致 ConcurrentModificationException。

iterator - Java HashMap add new entry while iterating - Stack Overflow

HashMap中ConcurrentModificationException异常解读 - yehx - 博客园

多线程并发问题

虽然 HashMap 不是线程安全的,但是假如有人在多线程场景下使用了 HashMap,有哪些常见的问题需要了解,了解这些方便我们遇到一个问题时更快地 DEBUG。

探究HashMap线性不安全(二)——链表成环的详细过程 - 比脚更长的路 - 博客园

谈谈HashMap线程不安全的体现 - God is a Coder.. - OSCHINA

浅谈HashMap与线程安全 (JDK1.8) - Captain_Eason - 博客园

HashMap为何从头插入改为尾插入 - 掘金

其他补充点

get 方法的参数

HashMap 的 put() 方法的签名是 put(K key, V value),而 get() 方法的签名是 get(Object obj)。为什么 get() 方法不使用泛型呢?原因应该是 HashMap 只要求取值时传入的对象和键逻辑相等即可,不要求必须是同一类型。

个人认为这个是 JDK 开发者的选择,像 C# 语言的 HashMap 中 get 方法是使用了参数的。

下边有一些文章讨论了这个问题:

java - What are the reasons why Map.get(Object key) is not (fully) generic - Stack Overflow

smallwig: Why does Set.contains() take an Object, not an E?

和 Hashtable 的区别

  1. Hashtable 是线程安全的,HashMap 不是。
  2. HashMap 允许 key 为 null,Hashtable不允许。

为什么 Bucket 的大小总是 2 的幂

我认为主要是性能原因,因为 HashMap 存取等基础操作需要根据 hashCode 确定 Bucket 的位置,这个时候要根据数组长度取模,如果数组长度是 2 的幂,那么取模的操作就可以从 x % n 改为 x & (n - 1),而位操作的运算速度是比取模运算的速度更快的。

JDK7 到 JDK8 的改进

  1. JDK 1.7 中,存储结构是数组 + 链表。JDK 1.8 中,存储结构正常情况下是【数组+链表】,当内部的数组长度大于 64 且链表长度大于 8 时,会转换为红黑树。转换为链表后,当从 HashMap 中删除元素,当红黑树节点数小于等于 6 的时候,会转换回链表。为什么是 6 呢,为了避免频繁地在链表和红黑树之间进行切换,造成效率下降。使用红黑树的优势,当相同 hash 的节点太多的时候,使用链表的时间复杂度是 O(n),而红黑树的时间复杂度是 O(logN),性能更好。

  2. 在 1.7 中,向链表中添加元素,用的是头插法,1.8 中用的是尾插法。

    在 1.7 中,在多线程情况下使用 HashMap,多个线程中都触发了 resize 的情况下,可能会导致链表成环,进而可能导致 CPU 100%,知道这个有利于线上问题排查,但是从一开始就要避免在多线程环境下使用 HashMap,而是要换成 ConcurrentHashMap。

如果当前已经有一个 HashMap 了,要保证线程安全的访问,如何处理

可以使用 Collections.synchronizedMap() 方法把原有 HashMap 包装一下,包装之后会返回一个新的 Map,从而实现线程安全。原理就是这个方法会把 HashMap 包装起来,返回一个 java.util.Collections.SynchronizedMap 类型,在访问所有方法的时候都会使用 synchronized 保证线程安全。

从红黑树转回链表的时候,怎么处理的?

面试题

  1. HashMap 的结构是怎样的?

    HashMap 底层是两层结构,第一层是数组,第二层是默认是链表。当内部的数组长度大于 64 且链表长度大于 8 时,链表会转换为红黑树。详情参考 put() 方法和 treeifyBin() 方法。

  2. put 元素的流程是什么样的?

    put 元素的时候,首先会根据 Key 的 hashCode 重新计算出来一个 hash 值,然后使用这个 hash 值对内部数组的长度取模,算出应该放在数组的哪个桶中,放入桶中后,如果桶的大小大于等于 8,则转换为红黑树。添加完元素之后,如果元素数量大于某个阈值,则调用 resize() 进行扩容。

  3. 为什么桶中元素超过 8 的时候转为红黑树?

    红黑树占内存更大,是链表的两倍,所以希望一般情况下不用到红黑树。Javadoc 中有提到,作者做了一系列实验,发现根据普松分布(Poisson distribution)的原理,当用户的 Key 的 Hash 规则是合理设计的情况下,桶的大小超过 8 的概率非常小。

  4. get 操作的时间复杂度

    通常我们认为 get 操作的平均时间复杂度为 O(1)。也就是说不管我们在 Map 中存储多少元素,基本上总是能在常数时间内获得结果。最坏时间复杂度为 O(log(n)),最坏情况下就是 Map 中 Key 的 hashCode 被重写了返回一个固定的 hash 值,这种情况下,HashMap 中所有元素集中在一个桶内,桶中使用的是红黑树结构,get 到一个元素的时间复杂度为 O(log(n))

  5. HashMap 什么情况下会扩容?扩容具体是怎么执行的?

    HashMap 中有一个 loadFactor 参数,默认值是 0.75,即当 HashMap 中保存的键值对的数量超过 内部数组长度 * loadFactor 时,就是发生扩容操作。扩容的时候,首先把内部数组的长度翻倍,然后执行 rehash 操作,会对原来保存的所有元素重新计算一次 hash 并取模,算出元素在新数组中的位置,然后把数据写入到新数组中。

    而 loadFactor 的默认值之所以选择 0.75,是 JDK 的开发者通过一些数学上的测算,认为 0.75 能够实现时间复杂度和空间复杂度上的平衡,这个在 HashMap 的 Javadoc 中都有详细的解释。

    扩容之后,原本可能会分配到同一条链表上的数据可能就会被打散到不同的链表中,从而分散数据的存储,避免链表过长的风险,提高性能。

  6. 讲讲hashmap在 jdk1.7和1.8的区别,具体扩容流程,新版本扩容有什么优化?如果在扩容过程中,发生put操作,数据落在新数组还是老数组?1.7在多线程下会产生什么问题?产生的原因?

参考资料

  1. HashMap Javadoc (Java Platform SE 8 )
  2. Hashtable Javadoc (Java Platform SE 8 )
  3. Java HashMap工作原理及实现 - Yikun的博客
  4. HashMap完全解读 -Hollis【基本上是基于 JDK 1.7】
  5. 全网把Map中的hash()分析的最透彻的文章,别无二家 - Hollis的博客【确实写的很好!】
  6. 天下无难试之HashMap面试刁难大全【作者知乎老钱,非人云亦云型】
  7. HashMap? ConcurrentHashMap? 相信看完这篇没人能难住你! - crossoverJie的博客【JDK 1.7、JDK 1.8】
  8. JDK7与JDK8中HashMap的实现 - 开源中国
  9. 红黑树 - 开源中国
  10. HashMap的loadFactor为什么是0.75?【分析的很深入了,还分析了数学原理】
  11. HashMap初始容量与负载因子设置如何影响HashMap性能
  12. Java 8系列之重新认识HashMap - 知乎 | 美团技术博客版:Java 8系列之重新认识HashMap - 美团技术团队
  13. 【不做标题党,只做纯干货】HashMap在Jdk1.7和1.8中的实现
  14. 老生常谈,HashMap的死循环 - 占小狼的简书
  15. HashMap的实现与优化
  16. HashMap实现原理及源码分析 - dreamcatcher-cx - 博客园【JDK 1.7】
  17. 阿里面试题:为什么Map桶中个数超过8才转为红黑树
  18. 京东二面:为什么HashMap底层树化标准的元素个数是8? - 简书
  19. hashmap中为何链表长度大于8才转换成红黑树 - 知乎
  20. 为什么 HashMap 中链表长度大于 8 才转化为红黑树? - 知乎
  21. 链表转红黑树的原因?为什么阈值为8? - JustJavaIt - 博客园【写的很好】
  22. HashMap为何从头插入改为尾插入 - 掘金
  23. hashmap头插法和尾插法区别一个跟面试官扯皮半个小时的HashMap牧云君的博客-CSDN博客
  24. (1)美团面试题:Hashmap的结构,1.7和1.8有哪些区别,史上最深入的分析_王伟的博客-CSDN博客_hashmap1.7和1.8的区别
  25. HashMap相关原理(1.7和1.8的区别,红黑树、扩容机制) - 墨天轮
  26. HashMap 1.7和1.8的区别 —答到面试官怀疑人生hashmap在1.7和1.8的区别程序员小章的博客-CSDN博客
  27. 面试官:讲讲HashMap实现原理,以及JDK1.7和1.8区别? | 哈利路牙儿
  28. HashMap系列:树化阀值8,退化阀值6 - 掘金
  29. HashMap 链表插入方式 → 头插为何改成尾插 ? - 青石路 - 博客园
  30. 干货:HashMap的工作原理解析 - 掘金【基本上是关于 JDK 1.7】

面试题相关

  1. 有关 HashMap 面试会问的一切【2021-11-10 除了算法题都过了一遍】
  2. HashMap面试题:90%的人回答不上来 - 简书【20210-11-10 看了一遍】
  3. HashMap 相关面试题及其解答 - 简书
  4. 【java集合】HashMap常见面试题 - CSDN博客
  5. HashMap的实现原理+阿里HasMap面试题_lizhen1114的博客-CSDN博客_hashmap面试题