Recently, with an improvement in the number and performance of complex machines in distributed and parallel processing systems, solving computationally intensive tasks for efficient resource utilization with improved performance has become easier. In the era of technology improvements, repeated system tests are indispensable. That is, the need for simulation tools is inevitable, almost in all aspects of science and arts, especially in the case of testing and expermenting with grand ideas or non-existing systems. Parallel computing technique was developed from serial computers to overcome the single state of work and to exploit multitasking. Performance improvement by application partitioning and scheduling tasks - on interconnected nodes to execute concurrently is the contemporary approach. With an increase in the number of processes/processors, the work done by each process/processor decreases -. Alternatively, an increase in the share among processes is an increase in its data usability and speeds up the performance.
The existence of parallel and heterogeneous systems simulator enables to select appropriate synchronization time for distributed and parallel heterogeneous systems. Simulation tools used to simulate the architecture structure of standalone computers in GEM5  and CPUSim  have been proposed, and network performance analysis and simulation have been proposed in NS3 . However, these simulation tools cannot be used in parallel and distributed heterogeneous computing system simulation as they have scalability issues. For instance, GEM5  can run a maximum of 2 parallel homogeneous systems at a time.
In regard parallel system simulation, Mohammed Alian et al.  have proposed Dist-GEM5 simulation tool to simulate the architectural structure and network behaviour of parallel and distributed heterogeneous computing systems. Dist-GEM5 in  creates communication channels for GEM5s installed on standalone devices to communicate via the host network system and to address network simulation tool’s (GEM5’s) scalability problem over parallel systems. COSSIM  can run distributed and parallel heterogeneous system models by integrating cGEM5 with OMNET++. It has applied customized GEM5, and customized OMNET++, which they call it, cGEM5 & cOMNET++ respectively. In  Anis and et al. gave Table 1 that makes a comparison on NS2, NS3, OMNET++ and others. To address the scalability issues raised in standalone simulators and to use the capabilities of NS3 over other network simulators, we propose an NS3 and cGEM5 integrated simulation tool. The proposed simulator used cGEM5 to simulate architectural structures such as CPU, memory, storage and input/outputs of parallel and distributed heterogeneous systems, and NS3 used to simulate the network devices, connection links, communication protocol suits and network loads of the nodes created by cGEM5.
The integration of the two independent simulators is the most essential part of the integrated simulator . The isolation complexity of operation and ease of the controllability for the integrated simulator are achieved by using HLA–RTI system for integration. The HLA–RTI system lets the two simulators operating independently, and this leads the integrated simulator to isolate problems and operations of each simulator. The HLA–RTI system is an open-source based standard system, and it is easy to understand and modify . In addition to these, the HLA–RTI system does not interfere with the measured result of tests; thus the integrated system becomes more robust to any problem caused by individual simulators. Distributed system simulators are used in design optimization and in design mistakes reductions .
Table 1. Performance comparison among network simulators 
The proposed NS3-cGEM5 integrated simulator is used for the synchronization measurement test and for finding the appropriate synchronizing time. The results of the proposed simulator are compared with other simulators and found a comparative result. This research work has an important role and a big hand for designers and system engineers for distributed and parallel heterogeneous systems simulation. As computing system evolve, systems become more complex, and the distributed and parallel system becomes common. The computing system is now evolving to newer architecture – like, memory oriented computing architecture, processing in the memory, and so on. The application of an integrated simulator gives more efficient, effective, and methodical means than applying multiple separated single simulator for parallel and heterogeneous systems. The contributions of this work are an alternative integrated simulator for distributed and parallel computing systems, an integration of existing NS3 with GEM5 for better-performing simulation performance than separated simulators, and fast, flexible and precise simulation by using NS3 features for parallel and distributed heterogeneous systems.
The rest of this paper is presented in the following manner. The next section surveyed the related works and followed by details of the methodology for the proposed integrated simulator, that describes details on the implementations of the proposed simulator. Section 4 gives the experimental tests and results of the proposed integrated simulator, and finally, section 5 outlines the conclusion drawn as well as future works.
2. Related Work
Simulators in computing are classified into architecture (processors) simulators and network simulators. As examples of architecture simulators, we can mention simulators like MikroSim, CPUSim, HASE, Sniper, GEM5 and Zsim. On the other hand, in case of network simulators that simulate the communication flow of a system, we can mention simulators like NS2, NS3, OMNET, OPNet and OMNET++ . In the following parts of this section, literature regarding this architecture, network and improved simulators will be discussed.
The NS3 is a network simulator that applies a discrete-event network simulation . it is predominantly targeted for researchers working on network communications and for network system courses. NS3 licensed under the GNU GPLv2, which is available for researchers and network system developers for free. It outlines a model of the proper working flow of packet data and provides a mechanism for modelling and simulation. NS3, unlike some other network simulators, can model with and without internet connections, but most researchers are using NS3 to model a system without a connection of internet.
NS3 uses two principal languages C++ and python , and scripts written in C++ or python can be used for execution. NetAnim is used for the animation to visually display the results. Both programming languages provide a robust library, which is helpful for the user requiring less effort to edit it for their specific need. NS3 provides models for wired technology, models of a simple network of Ethernet, which uses CSMA/CD as its network protocol. NS3 also delivers a set of 802.11 models to provide precise MAC-level employment of the specifications 802.11 and a PHY-level 802.11a model . It reduces the simulation memory footprint and allocates no memory for the virtual zero byte values. In NS3, mobility model is not required as the node position of the simulated network does not need to wired devices.
Moreover, NS3 is capable of producing packet trace files for debug purpose using PCAP (packet capturing mechanism). Protocol units in NS3 are designed to be nearly the same as that of real computers. Additional resources based on its open-source are supported in NS3 networking software and there reduce the need to rewrite models for simulation, but it is not able to simulate the architectural structure of computing systems.
Table 2. Network simulation tools 
NS3, OMNET++, and OPNet are capable of carrying out large-scale network simulations. Note that NS3 is the fastest simulator , among the mentioned simulators in Table 1 regarding computation time . Table 2 shows comparesion among networksimulators in variation points of view. NS3 has varieties of modules which show its modular capabilities. NS3 is publicly available for academic and non-academic use. It encourages the community contribution in the development of simulation models to be sufficiently realistic to permit NS3 to be used as a real-time network .
Researchers in different areas of fields may need their systems for test and require a flexible simulation system framework that can assess a wide diversity in designs and support rich OS services including input-output and networking. GEM5 is an open source software with BSD-based license, and the code is accessible to all researchers without any legal limitations .
Among architecture simulators, we selected and compared some familiar architecture simulators. We found GEM5 is the more capable architecture simulator. Table 3 shows a comparison among architecture simulating tools .
Table 3. Architecture simulation tools comparison
GEM5 is composed of M5 and GEMS, which have their impacts on architecture simulation history. It has various capabilities that outperform on other architecture simulators. GEM5 full system simulator supports many ISAs with various CPU models, and is possible to test different applications on system emulation base. In regard, the CPU type GEM5 acquired detailed CPU modelling from M5 of its components which are ‘AtomicSimple’, ‘TimingSimple’, ‘InOrder’ and ‘O3’ (Out Of Order), and the simulation.
Table 4. GEM5 simulation capability 
Furthermore, the GEM5 provides a flexible and modular simulation that can evaluate a broad range of systems . It is widely available to all researcher's simulator that overcomes limitations of modularity and poor coding problems by other simulators. This flexibility is achieved by offering a varied set of CPU models, mode of executions, and a variety of memory system models. Fig. 1 shows flexiblity and accuracy comparison in design proceses. Based on flexibility comparison, the programmers design is the most flexible than others. And concerning the accuracy, the RTL representation is the most accurate mechanism in testing. However, GEM5, it is located on the appropriate position in between the programmers flexibility and the RTL accuracy position as shown in Fig. 1. Nonetheless, GEM5 does not support parallel network modelling. If we want to run a full system simulation for a parallel system, only a pair of the similar system will be simulated with the same image files using “--dual” together with the GEM5 full system simulation command.
Fig. 1. GEM5 a flexible tool for architecture simulation 
Dist-GEM5 combines two autonomous development methods, the pd-GEM5 together with the multi-GEM5, which is considered as a GEM5 distributed version. Dist-GEM5 tried to simulate several nodes using multiple simulation systems. This simulator uses TCP sockets as a channel for transfer of synchronization and data messages between a switch node and a full-system node, which enables to prevent data messages from avoiding synchronization messages (due to the strict ordering between TCP packets) .
This simulator improves the checkpointing mechanism of its previous work pd-GEM5 and is strongly coupled with the Ethernet protocol. Dist-GEM5 can deliver a fast, scalable and detailed infrastructure of simulation for modelling and evaluating large computing groups .
Getting the network from the host system enables to create parallel distributed system. Server-1 from host one will connect with Client-2 of the next host. This will proceed until it gets the last connection from the end host. Nodes per given system and heterogeneity are the central lack of Dist-GEM5. Enhancement in the network performance of GEM5 was achieved with COSSIM. Note here that Dist-GEM5 is considered as an extension of GEM5 and no combination with other simulation is made with GEM5 in the formation of Dist-GEM5.
COSSIM delivers the necessary hooks to security testing software, making it possible to determine vulnerabilities and inspect the toughness of the system under design. It is the first integrated solution that can give the mechanism of simulation for the actual system of systems, network dynamics and energy aspects. The goal is to provide a solution that offers functionality greater than using each component separately. COSSIM applied GEM5 for designing a simulator for general purpose applications, to simulate a various kinds of nodes. COSSIM is a system simulator that is cycle-accurate, ISA independent, configurable, able to boot real-world operating systems and capable of executing software compiled for those systems . Further more, it applies a dedicated network simulator that handles all network related modelling from the physical layer of an NIC and beyond. For such purpose, OMNET++ is chosen.
In COSSIM, integration of GEM5 and OMNET++ is made with the help of High-Level Architecture (HLA). HLA is a general-purpose software architecture specifically designed for the development and implementation of a distributed simulation applications , defining the functional attributes, design rules and interfaces for simulation systems and specifying the communication between individual components. The cGEM5 in COSSIM is a customized GEM5 for lightweight and fast booting behavior of the image file run on GEM5. For this reason, the proposed simulator directly uses cGEM5  for the distributed and parallel hetrogrnous simulation.
Based on the above related works and other references, we have designed the proposed simulator approach to tackle the simulation problems observed in distributed systems and parallel heterogeneous computing systems. We designed the NS3-cGEM5 integrated simulator to test and allocate synchronization time among simulated nodes. The next chapter deals with the design approach for the NS3-cGEM5 integration.
3.1 HLA background
The set HLA federation input-output depends on formulator’s attributes and objects that make federation using HLA-RTI tool . HLA-RTI federation formation is well stated in detail on ‘Improving the HLA-CERTI framework’ . HLA represents varieties of RTIs, and CERTI is the selected RTI for the proposed integrated simulator. It is an open source HLA runtime infrastructure that supports HLA 1.3 specifications  and uses C++ and Java programming languages for the processing.
3.2 NS3-cGEM5 integration Components
The use of either the architecture simulator or the network simulator alone, for simulating parallel heterogeneous and distributed system will not give a precise simulating mechanism of communication systems and architectural structures. The proposed alternative integrated simulator uses a network simulator and architecture simulator in a combined form to solve the simulation problem on distributed and parallel hetrogeneous system. The customized GEM5 (cGEM5) for architecture simulation, and NS3 for network representation, and communication facilities of distributed and parallel systems are selected for the proposed simulator. From the comparison tables, we can get a bit of information that NS3 has better features over other network simulators. These NS3 properties presented in Table 5 and in the previous sections are the main reasons to select it as a component for the proposed integrated simulator. Details of NS3 and GEM5 are given in the previous section of related works.
Table 5. Network simulation tools comparison
3.3 HLA CERTI Architecture
In CERTI, each federates process interacts locally with an RTIA (Run Time Infrastructure Ambassador) through a Unix-domain socket . RTIA processes on exchange messages over the network, in particular with the RTIG (RTI Gateway) process, through TCP and/or UDP sockets. A specific role of RTIA is to immediately satisfy some federate requests, while other requests require network message sending or receiving. RTIA manages memory allocation for the message FIFOs (First In First Out) and always listens to both the federate and the network (RTIG). It has a significant role in the implementation of the tick function.
The RTIG (RTI Gateway) is a centralization point in the architecture. It has an essential role in managing the creation and destruction of federation executions and the publication/subscription of data. It plays a crucial role in message broadcasting which has been implemented by an emulated multicast approach. When a given message is received from a given RTIA, the RTIG delivers it to the interested RTIAs, avoiding true broadcasting.
HLA is a standard for distributed simulation and used when creating a simulator by combining (federating) several simulators. HLA was developed in the ’90s with US Department of Defense, later transitioned to become an open international IEEE standard .
In general, the independent nodes created in cGEM5 will communicate through HLA with the network communication help of NS3. Those nodes in cGEM5 have different architectural behaviour (heterogeneity), and these independent nodes need synchronization for harmonized tasks. The transaction flow of the proposed simulator is given in Fig. 2, and its detailed communication structure is depicted in Fig. 5.
Fig. 2. The tranaction flow and federation integration of the proposed simulator
After node creation occurs, synchronization among paired nodes will be done . In order to have the synchronized simulation, we have allocated a waiting time till all nodes are ready to communicate federates, involved in the created federation and repeat the same pattern of execution periodically with Δt time step.
Fig. 3. Synchronization addition in periodic federate scheme
During each time step for the repeated execution, federates carry four phases: a reception, a computation, a transmission and a slack time phases. It is important to execute explicitly adding a synchronization phase to ensure the global coherent run time of the whole simulation .
Fig. 4. Synchronization of NS3-cGEM5 integration for distributed system
In the proposed NS3-cGEM5 integration, if we consider four independent cGEM5 nodes (N1, N2, N3, N4), then we can consider NS3 simulation node as an additional independent node, which gives us a total of five independent nodes (see Fig. 4) having different arrival time for synchronization. The synchronization server (SynchServer) creates a waiting point to make sure that all the five nodes have arrived. It allocates a waiting specified synch time until all nodes are available. When it gets information about all nodes arrival, it will release the nodes for their execution . Then, it waits for the same period for the next synchronization and execution. Synchronization time (Synch time) of the system is allocated by the user.
3.5 Federation creation
In the federation creation processes, one of the federates is responsible for federation creation . Then the federate itself joins the federation. After this process, the next federates joins the federation and the predetermined task execution will be done. This execution is followed by the release of the lately joined federation, and the creator will kill the federation. In the case of NS3-cGEM5 integration, NS3 is the federation creator, and cGEM5 will join the federation and get released first after completing the task assigned for the federation.
In Fig. 5 there are ambassadors that play an impotrant role in HLA system. They are the federate ambassador and the RTI ambassador, these ambassadors are found beween HLA and the two federates side. Communications among client nodes, the server node and HLA created with the help of these ambassadors. The system call from HLA to federates passes through federate ambassadors and system callback from federates to HLA interaction returns through the RTI ambassador. The RTI ambassador is responsible for the communication of the federates and RTIG, that is, federates reach the RTIG through the RTI ambassador. On the other hand, the federate ambassador is responsible for the communication of RTIG and the federates, that is, RTIG reaches to the federates through the federate ambassador. Fig. 5 depicts these transaction in detail.
Fig. 5. Detailed node communication that shows the synchronization of NS3-cGEM5 integration for a distributed system. Ambassadors are responsible for communications between federates and HLA, and TCP applied for communications between NS3 nodes.
4. Experimental Tests & Results
In the integration of NS3-cGEM5, we used the following specified platform: Desktop CPU processor Intel ® Core™ i5-3570k CPU @3.40GHz processor speed, Linux Ubuntu 14.04, RAM 6GB, Storage size 1TB, NS3 (ns-allinone-3.19), and cGEM5 from COSSIM. After setting up the platform, we run the following experiments and got results.
To measure and compare the performance of the proposed integrated simulator, the following metrics are defined and used throughout this paper:
• Booting time: the time taken for each node created by NS3-cGEM5 simulator to be ready to operate after the execution command has been executed.
• Federation (Fed.) time: the very short time that takes for nodes from the two separate simulators, NS3 and cGEM5, to create a unified node through HLA-RTI system and be ready to boot.
• Synchronization (synch) time: in the HLA system, synch time is defined as the time required for updating each federates updates and synchronizing to the system.
• Starting time: the maximum time taken for the node created by NS3-cGEM5 simulator to be ready to operate after the execution command has been executed. Nodes may have different starting time based on their ISA and other factors; as a parallel system, starting time determines the time of system communication.
4.1 Experimental Tests
We had set up the NS3-cGEM5 integration and performed the experimental tests to check synchronization time test. Synchronization time in HLA defines the time that each federates updates and gets synchronized to the system. Synchronization time is set by the user together with the architecture to run on the integrated simulations. The primary objective of this test is to know the impact of synchronization time on booting time and to figure out the optimal synch time that gives the best synchronization and minimum booting time.
Table 6. Experimental test result for synch time vs. booting time test
On these experimental tests, we allocted varied synchronization time, and measured the federation time, and booting time. Having measured values, we calculated the average booting time, and determined the total time required for booting a system. Note here that we considered the slowest booting time plus the federation time as a minimum booting time for the system, as a system it should have all architectures started.
We varied the synchronization time from 10 to 100,000 ms, which means that they will update to the SynchServer based on these synchronization times. In most simulators 10 ms synchronization time is considered as the default synchronization time. Based on our measurements, we had an analytical discussion as follows.
The experimental test focused on the synchronization of heterogeneous system of x86 and ARM combinations. Synchronization time is very vital for a distributed system. Regarding this test, Fig. 6 shows an analytical property of federation time variation from 10ms to 100,000 ms of synch time. From Fig. 6, we have seen that an increase in synchronization time will not affect the federation time - which nearly 10 sec thorough out the synchronization time (almost identical). The federation time has negligible impact on booting time determination, so it is not a means to reduce booting time. Fig. 7 depicts the the booting time of ARM and x86 architectures as a function of Synchronization. In the Fig. 7, we can observe a reduction in booting time at some point near 10,000 ms of synch time. This experimental result shows that repeated experiments and measurements are required to determine the synchronization time. Synchronization time is varied on x86 and ARM image operating systems. Generally the booting time of ARM is expected to be lower than that of x86, and also this means that we can expect that the ARM operating system image is faster in booting time than that of x86. The experimental results showed faster booting time of ARM operating system image than x86 as expected, and some specific time of synchronization time (10,000 ms in Fig. 7) showed much faster booting time over all other sync time. And we can consider 10,000 ms as an optimal synchronization time for booting time.
Fig. 6. Federation time as a function of synch time.
Fig. 7. ARM and x86 booting time vs. synch time
Based on our measurements, we have also plotted graphs about federation time, average total booting time, and starting time according to various synchronization time for x86 and ARM. Fig. 8 shows a collective view about measured time. In Fig. 8, we can note that the starting time of the integrated system simulation is significantly increasing for 100,000 ms and above synchronization time.
In the following analysis, we focused on the time window of synchronization time between 10 to 10,000 ms, and put aside the results for the time above 100,000 ms for future analysis, because synchronization time of 100,000 ms and above are too large and impratical for a real simulation run. Within the time window of interest, 100 ms and 10,000 ms of synchronization time is more appropriate time of synchronization for use in starting time (100 ms) and average total booting time (10,000 ms) point of view. Here, the federation time has negligable impact as compared to booting and starting time, because it is too small and almost constant for various synch time variations.
Fig. 8. Federation time, average total booting time and starting time as a function of synch time
We compared the proposed simulator with COSSIM and the result shows nearly similar behavior, especially for lower synchronizattion time. Table 7 shows the measurement results of variation in booting time as a function of synchronization time between COSSIM and the proposed simulator. And the result is depcited in Fig. 9 that shows the comparison between the two simulators.
Table 7. Starting time as a function of synchronization time in COSSIM and NS3-cGEM5 integrated simulators for ARM and x86 architectures
Fig. 9. Starting time as a function of synchronization time in COSSIM and NS3-cGEM5 integrated simulators
5. Conclusion and Future work
To tackle the challenges of synchronization time allocation in the integrated simulation of distributed and parallel heterogeneous systems, we have proposed an integrated simulator composed of NS3 and cGEM5 for appropriate synchronization time allocation.
From our tests, we simulated heterogeneous and homogenous systems with the variation of synchronization time to determine of booting time. We showed that integrated simulator is more capable of simulating heterogenous systems. And we also showed that a repeated simulation test is required to determine an appropriate synchronization time parameter for better and faster simulation.
From our measured tests we can conclude that the distributed system simulation is not easily achieved with single system simulator, like GEM5 and NS3 which are simulating architecture and network systems only. With the proposed NS3-cGEM5 integrated simulator, we performed various different architectural simulations (including homogenous and heterogeneous systems), and tested synchronization time effect on booting and execution time of parallelly coupled systems.
Generally, we can conclude that the proposed simulating system is capapble to allocating appropriate synchronization time for homogenous and heterogeneous systems. For parallel processes and distributed systems, the synchronizing server which depends on synchrionization time is vital in achieving synchronization for distributed system simulation of harmonized system communication. Our test results reveals that the minimum synchronization time does not always corrspond to the fastest in the booting time; there is an even smaller booting time and faster starting time in between. As a results, the selection of the synchonization time parameter for integrated simulator needs frequent tests to decide the optimal synchronization time for the integrated simulator.
While testing and mesuring the integrated simulator, we found an unexpected increase in booting time in some nodes over 100,000 ms of synch time. Even the application of 100,000 ms of synch time in real simulation is not practical, the reason and mechnism of the behavior will be analyze in a future study.
This work was supported by the Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2018-0-00503, Researches on next generation memory-centric computing system architecture).