Hash表的理解以及实现 -

大_圣

浏览: 17053 次
性别:
来自: 湖南

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (20)

社区版块

存档分类

Hash表的理解以及实现

1. 理解

为每个要被存储的对象给定一个关键字,用一个Hash函数,把这个关键字映射到一个存储单元的地址. 这样, 在查找这个对象的时候, 只需要知道该对象的关键字. 再通过Hash函数, 便可以直接到该地址下的内存单元中去寻找所需要的数据.

但是,这当中又存在一个问题.. 对于每个不同的关键字. 通过Hash函数得到的地址是不是绝对不一样 ? 我是不知道会不会绝对不一样.. 但是数学家们说不同的关键字通过Hash函数也会有可能得到一样内存地址(胡*总说的好, 数学家说什么你就得信什么).

于是又出现一个问题: 解决Hash冲突.

解决Hash冲突的方法:1)拉链法;

2)开放定址法;

3)双Hash函数法;

......

ps:(1) 拉链法: 即不同对象的关键字通过Hash函数得到的内存地址的值如果是一样的的话, 就将这两个(或多个)对象存储在一条线性链表中

如图:{dt1,dt8}, {dt4, dt7}, {dt3, dt6, dt10}, {dt2, dt9}通过Hash函数算得的地址值是一样的, 故它们分别用一条链接起来, 可以看出, 该表中的数组里的每个元素其实是一个链表的表头.

(2) 开放定址法: 就是通过Hash函数算得的地址如果是一样的话, 就往该地址之后的存储空间去寻找, 只要找到有空间可以存储, 就把该数据放到该空间里存储起来 (线性探查法; 平方探查法)

(3) 双Hash函数法: 即给定两个Hash函数, 当通过第一个Hash函数得到的地址与其他数据地址冲突时, 将得到的值通过另外一个Hash函数再得到一个地址值, 用来尽量避免冲突.(可以扩展到多Hash函数)

不难看出, 一个Hash表的存储性能与其Hash函数有着很密切的联系

而Hash函数又有多种构造方法:1) 直接定址法;

2) 除留余数法;

3) 数字分析法

........;

ps: (1) 直接定址法: 就是通过各个要被存储的数据的关键字本身或者加上某个常量值作为地址(个人觉得: 如果一个Hash表通过这样的方法来构造, 我还是直接显式的用数组算了).

(2) 除留余数法: 以各个数据的关键字除以某个不大于Hash表长度的数, 得到的余数作为该数据的Hash地址.

(3) 数字分析法: ... 这个就是得看具体问题了.

2. 实现(1):

首先,是Hash函数. 开始我是采用的取余的方法; 当存储的数据总量达到Map的0.75的时候,就开始扩容,每次扩大为原来的两倍. 话不多说, 上代码:

public class HashTest<K, V> {

	// 记录Map的长度
	private static int size;
	// 存放数据的数组
	private Node[] nodeArray;
	// 初始容量为11
	private static int CAPACITY = 11;
	// 数组中存放的数据为数组总长度的0.75则扩容
	private static float LOAD_FACTORY = 0.75f;

	/**
	 * @param Capacity
	 *            指定Map容量
	 * @param Factory
	 *            构造因子
	 */
	HashTest(int Capacity, float Factory) {
		if (Capacity < 0)
			throw new IllegalArgumentException("Illegal initial capacity: "
					+ Capacity);
		if (Factory <= 0 || Float.isNaN(Factory))
			throw new IllegalArgumentException("Illegal load factor: "
					+ Factory);
		this.CAPACITY = Capacity;
		this.LOAD_FACTORY = Factory;
		size = 0;
		nodeArray = new Node[CAPACITY]; // 以该容量为长度创建结点数组
	}

	/**
	 * 无参构造器
	 */
	HashTest() {
		this(CAPACITY, LOAD_FACTORY);
	}

	// 以取模得到的值为下标
	public void put(K Key, V Value) {
		size++;
		Node node = new Node<K, V>(Key, Value);
		node.hash = Key.hashCode();
		int index = Math.abs(node.hash % CAPACITY);
		// 如果放在数组中的此位置原来没有元素时, 加在本位置
		if (nodeArray[index] == null) {
			nodeArray[index] = node;
		} else {// 如果原来有元素, 则把本位置的元素替换成新加的元素, 原来在此位置的元素链接到新加元素后
			node.next = nodeArray[index];
			nodeArray[index] = node;
		}
		if ((float) size / (float) CAPACITY > LOAD_FACTORY) {
			CAPACITY = 2 * CAPACITY;// 更新Map容量
			extend(CAPACITY);// 判断是否扩容
		}
	}

	// 根据Key查找映射当中Value值的方法
	public V get(K Key) {
		int index = Math.abs(Key.hashCode() % CAPACITY);
		Node node = null;
		V reuslt = null;
		for (node = nodeArray[index]; node != null; node = node.next) {
			if (Key.equals(node.k)) {
				reuslt = (V) node.v;
				break;
			}
		}
		return reuslt;
	}

	// 扩容的方法
	public void extend(int Capacity) {
		Node[] newArray = new Node[Capacity];
		for (int i = 0; i < nodeArray.length; i++) {
			if (nodeArray[i] != null) {
				Node n = nodeArray[i];
				while (n != null) {
					Node next = n.next;
					int index = Math.abs(n.hash % Capacity);
					if (newArray[index] == null) {
						newArray[index] = n;
						n.next = null;
					} else {// index位置下原来就有元素
						n.next = newArray[index];
						newArray[index] = n; // 将新添加的元素放到该位置,
					}
					n = next;
				}
			}
		}
		nodeArray = newArray;// 更新数组

	}

	public int size() {
		return size;
	}

	// 用来存放键值对
	class Node<K, V> {
		K k;
		V v;
		int hash; // 存储当前放在数组中的结点的hash code
		Node<K, V> next;// 同一个hash值下, 此值存储的是下一个Node的地址

		Node(K k, V v) {
			this.k = k;
			this.v = v;
		}

	}
}

由于Hash函数比较简单, 存储的性能还算过的去, 以下是测试方法:

public static void main(String args[]) {
		HashTest map = new HashTest<String, String>();
		long time = System.currentTimeMillis();
		for (int i = 0; i <= 1000000; i++) {
			map.put("" + i, "" + i);
		}
		long time1 = System.currentTimeMillis();
		System.out.println("存储时间:" + (time1 - time));
		long time2 = System.currentTimeMillis();
		String s = (String) map.get("1000000");
		long time3 = System.currentTimeMillis();
		System.out.println("查找时间:" + (time3 - time2) + " 查找到的Value值:" + s);
	}

存储一百万个数据, 最后输出的结果是:

存储时间:2242

查找时间:0 查找到的Value值:1000000

实现(2):

看了下Java中HashMap的源代码, 对于以下的hash函数和indexFor函数比较有兴致(各种位运算, 觉得没兴致才怪... 好吧, 其实..之所以会想到用系统给的方法, 是因为之前在用取余数法的时候碰到了一些小问题, 导致性能低下的不能再低下, 然后就写了这个实现):

static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

static int indexFor(int h, int length) {
        return h & (length-1);
    }

public class HashTest<K, V> {

	private int size; // 元素个数
	// 初始容量为11
	private static int CAPACITY = 11;
	// 数组中存放的数据为数组总长度的0.75则扩容
	private static float LOAD_FACTORY = 0.75f;
	private Node<K, V>[] nodeArray;// 存储结点的数组

	// 指定容量个构造因子的构造函数
	@SuppressWarnings("unchecked")
	HashTest(int Capacity, float Factory) {
		if (Capacity < 0)
			throw new IllegalArgumentException("Illegal initial capacity: "
					+ Capacity);
		if (Factory <= 0 || Float.isNaN(Factory))
			throw new IllegalArgumentException("Illegal load factor: "
					+ Factory);
		CAPACITY = Capacity;
		LOAD_FACTORY = Factory;
		size = 0;
		nodeArray = new Node[CAPACITY];
	}

	// 无参构造函数
	HashTest() {
		this(CAPACITY, LOAD_FACTORY);
	}

	// 得到hash码 (借用系统的方法)
	private int hash(int h) {
		h ^= (h >>> 20) ^ (h >>> 12);
		return h ^ (h >>> 7) ^ (h >>> 4);
	}

	// 根据hash code来得到元素要放在哪个位置(借用系统的方法)
	private int FindIndex(int hash, int length) {
		return hash & (length - 1);
	}

	// 存放键值对
	@SuppressWarnings("unchecked")
	public void put(K Key, V Value) {
		size++;
		Node node = new Node();
		node.k = Key;
		node.v = Value;
		// 得到本结点的hash码. 放入结点中, 用于之后扩容.
		int hash = hash(Key.hashCode());
		node.hashcode = hash;
		int index = FindIndex(hash, CAPACITY); // 找到要放的位置
		node.next = nodeArray[index]; // 原来在该位置的元素链到要添加的本元素后
		nodeArray[index] = node; // 将新添加的元素放到该位置
		if ((float) size / (float) CAPACITY > LOAD_FACTORY) {// 总个数大于数组容量的0.75时就扩容
			CAPACITY *= 2;
			extend(CAPACITY);
		}
	}

	// 扩容的方法
	@SuppressWarnings("unchecked")
	private void extend(int capacity) {
		Node[] newArray = new Node[capacity];
		Node n = null, next = null;
		for (int i = 0; i < nodeArray.length; i++) {
			if ((n = nodeArray[i]) != null) { // 该位置上有元素的时候
				while (n != null) {// 重新放置每个元素的位置
					next = n.next;
					int index = FindIndex(n.hashcode, capacity);
					n.next = newArray[index]; // 将本来在该位置上的元素放到将被放到此位置上的元素之后
					newArray[index] = n;
					n = next;
				}
			}
		}
		nodeArray = newArray;
	}

	// 根据Key值得到Map中的元素值
	@SuppressWarnings("unchecked")
	public V get(K Key) {
		int hash = hash(Key.hashCode());
		int index = FindIndex(hash, CAPACITY);
		Node node = null;
		V reuslt = null;
		for (node = nodeArray[index]; node != null; node = node.next) {
			if (Key.equals(node.k)) {
				reuslt = (V) node.v;
				break;
			}
		}
		return reuslt;
	}
}

// 结点类
class Node<K, V> {
	int hashcode; // 存储自己的hash码, 便于之后的查找
	K k;
	V v;
	Node<K, V> next; // 下一个结点地址

}

最后的存储性能确是比取余法要差了那么一点: 同样是一百万个数据存储, 这次用了 2600 多毫秒.

以下是测试方法:

public static void main(String args[]) {
		HashTest<String, String> map = new HashTest<String, String>();
		long time = System.currentTimeMillis();
		for (int i = 0; i <= 1000000; i++) {
			map.put("" + i, "" + i);
		}
		long time1 = System.currentTimeMillis();
		System.out.println("存储时间:" + (time1 - time));
		long time2 = System.currentTimeMillis();
		String s = (String) map.get("1000000");
		long time3 = System.currentTimeMillis();
		System.out.println("查找时间:" + (time3 - time2) + " 查找到的Value值:" + s);
	}

这次的存储一百万个数据输出的结果:

存储时间:2632

查找时间:0 查找到的Value值:1000000