The core of the core of the big data solutions -- Map

Discussion:

(too old to reply)

p***@googlemail.com

2015-04-07 12:51:21 UTC

{ edited by mod to shorten lines to ~70 characters. -mod }

Title: The core of the core of the big data solutions -- Map
Author: pengwenwei
Email:
Language: c++
Platform: Windows, linux
Technology: Perfect hash algorithm
Level: Advanced
Description: Map algorithm with high performance
Section MFC c++ map stl
SubSection c++ algorithm
License: (GPLv3)

Download demo project - 1070 Kb
Download source - 1070 Kb

Introduction:
For the c++ program, map is used everywhere.And bottleneck of program
performance is often the performance of map.Especially in the case of
large data,and the business association closely and unable to realize
the data distribution and parallel processing condition.So the
performance of map becomes the key technology.

In the work experience with telecommunications industry and the
information security industry, I was dealing with the big bottom
data,especially the most complex information security industry data,all
can't do without map.

For example, IP table, MAC table, telephone number list, domain name
resolution table, ID number table query, the Trojan horse virus
characteristic code of cloud killing etc..

The map of STL library using binary chop, its has the worst
performance.Google Hash map has the optimal performance and memory at
present, but it has repeated collision probability.Now the big data
rarely use a collision probability map,especially relating to fees,
can't be wrong.

Now I put my algorithms out here,there are three kinds of map,after
the build is Hash map.We can test the comparison,my algorithm has the
zero probability of collision,but its performance is also better than
the hash algorithm, even its ordinary performance has no much
difference with Google.

My algorithm is perfect hash algorithm,its key index and the principle
of compression algorithm is out of the ordinary,the most important is
a completely different structure,so the key index compression is
fundamentally different.The most direct benefit for program is that
for the original map need ten servers for solutions but now I only
need one server.
Declare: the code can not be used for commercial purposes, if for
commercial applications,you can contact me with QQ 75293192.
Download:
https://sourceforge.net/projects/pwwhashmap/files

Applications:
First,modern warfare can't be without the mass of information query,
if the query of enemy target information slows down a second, it could
lead to the delaying fighter, leading to failure of the entire war.
Information retrieval is inseparable from the map, if military products
use pwwhashMap instead of the traditional map,you must be the winner.

Scond,the performance of the router determines the surfing speed, just
replace open source router code map for pwwHashMap, its speed can
increase ten times.
There are many tables to query and set in the router DHCP ptotocol,such
as IP,Mac ,and all these are completed by map.But until now,all map are
using STL liabrary,its performance is very low,and using the Hash map
has error probability,so it can only use multi router packet dispersion
treatment.If using pwwHashMap, you can save at least ten sets of equipment.

Third,Hadoop is recognized as the big data solutions at present,and its
most fundamental thing is super heavy use of the map,instead of SQL
and table.Hadoop assumes the huge amounts of data so that the data is
completely unable to move, people must carry on the data analysis in
the local.But as long as the open source Hadoop code of the map changes
into pwwHashMap, the performance will increase hundredfold without any
problems.

Background to this article that may be useful such as an introduction to
the basic ideas presented:
http://blog.csdn.net/chixinmuzi/article/details/1727195

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Wenwei Peng

2015-06-03 14:43:22 UTC

Permalink

{ edited by mod to shorten lines to ~70 characters. -mod }

Title: The core of the big data solutions -- Map.
Author: pengwenwei
Email: ***@foxmail.com
Language: c++
Platform: Windows, linux
Technology: Perfect hash algorithm
Level: Advanced
Description: A high performance map algorithm
Section MFC c++ map stl
SubSection c++ algorithm
License: (GPLv3)

Map is widely used in c++ programs. Its performance is critical to
programs' performance. Especially in big data and the scenarios which
can't realize data distribution and parallel processing.

I have been working on big data analysis for many years in
telecommunition and information security industry. The data analysis
is so complicated that they can't work without map. Especially in
information security industry, the data is much more complicated than
others. For example, ip table, mac table, telephone numbers table, dns
table etc.

Currently, the STL map and Google's hash map are the most popular maps.
But they have some disadvantages. The STL map is based on binary chop,
which causes a bad performance. Google Hash map has the best performance
at present, but it has probability of collision. For big data analysis,
the collision probability is unacceptable.

Now I would like to publish pwwMap. It includes three different maps
for different scenarios:
1. Memory Map(memMap): It has a good access speed. But its size is
limited by memory size.
2. Harddisk Map(diskMap): It utilizes hard disk to store data. So it
could accept much more data than memory map.
3. Hashmap(hashMap): It has the best performance and a great lookup
speed, but it doesn't have 'insert' and 'delete' functionality.

MemMap and diskMap could be converted to hashMap by function
memMap2HashMap and diskMap2HashMap. According to the test result, my
algorithms' collision probability is zero. About performance, memMap
has a comparable performance with google, and hashMap's performance
is 100 times better than Google's hashmap.

In summary, pwwhash are perfect hash algorithms with zero collision
probability. You can refer to following artical to find the key index
and compress algorithm theory:
http://blog.csdn.net/chixinmuzi/article/details/1727195

Source code and documents:
https://sourceforge.net/projects/pwwhashmap/files/?source=navbar

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James K. Lowden

2015-06-03 18:56:23 UTC

Permalink

On Wed, 3 Jun 2015 08:43:22 CST

Post by Wenwei Peng
MemMap and diskMap could be converted to hashMap by function
memMap2HashMap and diskMap2HashMap. According to the test result, my
algorithms' collision probability is zero. About performance, memMap
has a comparable performance with google, and hashMap's performance
is 100 times better than Google's hashmap.

Why not use mmap(2) to combine memMap and diskMap? If you use offsets
instead of pointers, the map file could be saved and re-used.

Unless I mistake your meaning, the collision probability of an
algorithm is a mathematical property of the algorithm. It's something
you prove, not test.

--jkl

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Continue reading on narkive:

Search results for 'The core of the core of the big data solutions -- Map' (Questions and Answers)

replies

Which one performs better?AMD Athlon 64 X2 6000+or Intel core 2 duo E6600?

started 2007-04-11 07:50:45 UTC

desktops

replies

a starlight problem for nonchristians(not meaning nonchristians in a bad way)?

started 2009-04-15 15:52:58 UTC

astronomy & space

replies